# Use Presidio Anonymizer for Pseudonymization of PII data

Pseudonymization is a data management and de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms.

In this notebook, we'll show an example of how to use the Presidio Anonymizer library to pseudonymize PII data. In this example, we will replace each value with a unique identifier (e.g. `<PERSON_14>`). Then, we'll de-anonymize the data by replacing the unique identifiers back with their mapped PII values.

**Important**: The following logic is _not thread-safe_ and may produce incorrect results if run concurrently in a multi-threaded environment, since the mapping has to be shared between threads/workers/processes.

## Install Dependencies

First, let's install the required packages.

In [4]:
# Download presidio
!pip install presidio_analyzer presidio_anonymizer
!python -m spacy download en_core_web_lg


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-lg==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


## Import Required Libraries

In [5]:
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine, OperatorConfig
from presidio_anonymizer.operators import Operator, OperatorType

from typing import Dict
from pprint import pprint

## 1. Using the `AnalyzerEngine` to identify PII in a text

The AnalyzerEngine scans the text and identifies Personally Identifiable Information (PII) entities such as names, locations, emails, phone numbers, etc.

In [6]:
text = "Peter gave his book to Heidi which later gave it to Nicole. Peter lives in London and Nicole lives in Tashkent."
print("Original text:")
pprint(text)

analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text, language="en")

print("\nAnalyzer results:")
pprint(analyzer_results)

Original text:
('Peter gave his book to Heidi which later gave it to Nicole. Peter lives in '
 'London and Nicole lives in Tashkent.')



Analyzer results:
[type: PERSON, start: 0, end: 5, score: 0.85,
 type: PERSON, start: 23, end: 28, score: 0.85,
 type: PERSON, start: 52, end: 58, score: 0.85,
 type: PERSON, start: 60, end: 65, score: 0.85,
 type: LOCATION, start: 75, end: 81, score: 0.85,
 type: PERSON, start: 86, end: 92, score: 0.85,
 type: LOCATION, start: 102, end: 110, score: 0.85]


## 2. Creating a custom Anonymizer (called Operator) which replaces each text with a unique identifier

To create a custom anonymizer, we need to create a class that inherits from `Operator` and implement the `operate` method. This method receives the original text and a dictionary called `params` with the configuration defined by the user.

The `entity_mapping` is a dictionary that maps each entity value to a unique identifier, for each entity type.

In [7]:
class InstanceCounterAnonymizer(Operator):
    """
    Anonymizer which replaces the entity value
    with an instance counter per entity.
    """

    REPLACING_FORMAT = "<{entity_type}_{index}>"

    def operate(self, text: str, params: Dict = None) -> str:
        """Anonymize the input text."""

        entity_type: str = params["entity_type"]

        # entity_mapping is a dict of dicts containing mappings per entity type
        entity_mapping: Dict[Dict:str] = params["entity_mapping"]

        entity_mapping_for_type = entity_mapping.get(entity_type)
        if not entity_mapping_for_type:
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=0
            )
            entity_mapping[entity_type] = {}

        else:
            if text in entity_mapping_for_type:
                return entity_mapping_for_type[text]

            previous_index = self._get_last_index(entity_mapping_for_type)
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=previous_index + 1
            )

        entity_mapping[entity_type][text] = new_text
        return new_text

    @staticmethod
    def _get_last_index(entity_mapping_for_type: Dict) -> int:
        """Get the last index for a given entity type."""
        return len(entity_mapping_for_type)

    def validate(self, params: Dict = None) -> None:
        """Validate operator parameters."""

        if "entity_mapping" not in params:
            raise ValueError("An input Dict called `entity_mapping` is required.")
        if "entity_type" not in params:
            raise ValueError("An entity_type param is required.")

    def operator_name(self) -> str:
        return "entity_counter"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

## 3. Passing the new operator to the `AnonymizerEngine` and use it to anonymize the text

Now we create an instance of the AnonymizerEngine, add our custom anonymizer, and use it to anonymize the text.

In [8]:
# Create Anonymizer engine and add the custom anonymizer
anonymizer_engine = AnonymizerEngine()
anonymizer_engine.add_anonymizer(InstanceCounterAnonymizer)

# Create a mapping between entity types and counters
entity_mapping = dict()

# Anonymize the text
anonymized_result = anonymizer_engine.anonymize(
    text,
    analyzer_results,
    {
        "DEFAULT": OperatorConfig(
            "entity_counter", {"entity_mapping": entity_mapping}
        )
    },
)

print("Anonymized text:")
pprint(anonymized_result.text)

Anonymized text:
('<PERSON_2> gave his book to <PERSON_3> which later gave it to <PERSON_0>. '
 '<PERSON_2> lives in <LOCATION_2> and <PERSON_0> lives in <LOCATION_0>.')


## Examining the Entity Mapping

Let's look at the entity mapping that was created during anonymization. This mapping will be used later for de-anonymization.

In [9]:
print("Entity Mapping:")
pprint(entity_mapping, indent=2)

Entity Mapping:
{ 'LOCATION': {'London': '<LOCATION_2>', 'Tashkent': '<LOCATION_0>'},
  'PERSON': { 'Heidi': '<PERSON_3>',
              'Nicole': '<PERSON_0>',
              'Peter': '<PERSON_2>'}}


## 4. De-anonymizing the text using the entity_mapping

Similar to the anonymization operator, we create a custom de-anonymization operator. This operator will replace the unique identifiers with the original values.

In [10]:
class InstanceCounterDeanonymizer(Operator):
    """
    Deanonymizer which replaces the unique identifier
    with the original text.
    """

    def operate(self, text: str, params: Dict = None) -> str:
        """De-anonymize the input text."""

        entity_type: str = params["entity_type"]

        # entity_mapping is a dict of dicts containing mappings per entity type
        entity_mapping: Dict[Dict:str] = params["entity_mapping"]

        if entity_type not in entity_mapping:
            raise ValueError(f"Entity type {entity_type} not found in entity mapping!")
        if text not in entity_mapping[entity_type].values():
            raise ValueError(f"Text {text} not found in entity mapping for entity type {entity_type}!")

        return self._find_key_by_value(entity_mapping[entity_type], text)

    @staticmethod
    def _find_key_by_value(entity_mapping, value):
        for key, val in entity_mapping.items():
            if val == value:
                return key
        return None

    def validate(self, params: Dict = None) -> None:
        """Validate operator parameters."""

        if "entity_mapping" not in params:
            raise ValueError("An input Dict called `entity_mapping` is required.")
        if "entity_type" not in params:
            raise ValueError("An entity_type param is required.")

    def operator_name(self) -> str:
        return "entity_counter_deanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Deanonymize

## 5. De-anonymizing the text

Now we create a DeanonymizeEngine, add our custom deanonymizer, and use it to restore the original text from the anonymized version.

In [11]:
deanonymizer_engine = DeanonymizeEngine()
deanonymizer_engine.add_deanonymizer(InstanceCounterDeanonymizer)

deanonymized = deanonymizer_engine.deanonymize(
    anonymized_result.text,
    anonymized_result.items,
    {"DEFAULT": OperatorConfig("entity_counter_deanonymizer",
                               params={"entity_mapping": entity_mapping})}
)

print("Original text:")
pprint(text)

print("\nAnonymized text:")
pprint(anonymized_result.text)

print("\nDe-anonymized text:")
pprint(deanonymized.text)

print("\nMatch original:", text == deanonymized.text)

Original text:
('Peter gave his book to Heidi which later gave it to Nicole. Peter lives in '
 'London and Nicole lives in Tashkent.')

Anonymized text:
('<PERSON_2> gave his book to <PERSON_3> which later gave it to <PERSON_0>. '
 '<PERSON_2> lives in <LOCATION_2> and <PERSON_0> lives in <LOCATION_0>.')

De-anonymized text:
('Peter gave his book to Heidi which later gave it to Nicole. Peter lives in '
 'London and Nicole lives in Tashkent.')

Match original: True


## 6. Testing with a Longer Text Example

Now let's test the pseudonymization process with a longer, more complex text that contains various types of PII data. We'll also track the processing time for each step to understand performance characteristics.


In [12]:
import time

# Create a longer text example with various PII types
long_text = """
Dr. Sarah Johnson, a renowned cardiologist at Johns Hopkins Hospital in Baltimore, Maryland, 
recently published a groundbreaking research paper. She can be reached at sarah.johnson@hopkins.edu 
or by phone at (410) 555-0123. Her colleague, Dr. Michael Chen, who works at Massachusetts General 
Hospital in Boston, collaborated on the study. Dr. Chen's contact information includes his email 
mchen@mgh.harvard.edu and phone number 617-555-0198.

The research involved patients from various locations including New York City, Los Angeles, and 
Chicago. Key participants included Robert Williams (SSN: 123-45-6789), who lives at 1234 Main Street, 
New York, NY 10001, and Jennifer Martinez, residing at 5678 Oak Avenue, Los Angeles, CA 90001. 
Another participant, David Thompson, can be contacted at david.thompson@email.com or (312) 555-0145.

The study was funded by the National Institutes of Health (NIH) and involved collaboration with 
researchers from Stanford University in California and the University of Pennsylvania in Philadelphia. 
Dr. Johnson's assistant, Emily Rodriguez, coordinated the logistics. She can be reached at 
emily.rodriguez@hopkins.edu or (410) 555-0167.

Additional team members included Dr. James Anderson from the Mayo Clinic in Rochester, Minnesota, 
and Dr. Lisa Wang from the Cleveland Clinic in Ohio. The research findings were presented at the 
American Heart Association conference in Dallas, Texas, where Dr. Johnson delivered the keynote address.

For media inquiries, please contact the hospital's public relations department at pr@hopkins.edu 
or call (410) 555-0200. The research paper is available online at https://www.hopkins.edu/research/cardiology2024.
"""

print("=" * 80)
print("LONGER TEXT EXAMPLE - PSEUDONYMIZATION WITH TIMING")
print("=" * 80)
print(f"\nText length: {len(long_text)} characters")
print(f"Text word count: {len(long_text.split())} words")
print("\n" + "-" * 80)

# Initialize engines (reuse from previous cells)
# Note: analyzer, anonymizer_engine, and deanonymizer_engine should already be initialized
# If running this cell independently, uncomment the following lines:
# analyzer = AnalyzerEngine()
# anonymizer_engine = AnonymizerEngine()
# anonymizer_engine.add_anonymizer(InstanceCounterAnonymizer)
# deanonymizer_engine = DeanonymizeEngine()
# deanonymizer_engine.add_deanonymizer(InstanceCounterDeanonymizer)

# Step 1: Analyze the text (identify PII)
print("\n[1] Analyzing text for PII entities...")
start_time = time.time()
analyzer_results_long = analyzer.analyze(text=long_text, language="en")
analysis_time = time.time() - start_time

print(f"   ✓ Analysis completed in {analysis_time:.4f} seconds")
print(f"   ✓ Found {len(analyzer_results_long)} PII entities")
print(f"   ✓ Entity types found: {set([r.entity_type for r in analyzer_results_long])}")

# Step 2: Anonymize the text
print("\n[2] Anonymizing text...")
# Create a fresh entity mapping for this test
entity_mapping_long = dict()

start_time = time.time()
anonymized_result_long = anonymizer_engine.anonymize(
    long_text,
    analyzer_results_long,
    {
        "DEFAULT": OperatorConfig(
            "entity_counter", {"entity_mapping": entity_mapping_long}
        )
    },
)
anonymization_time = time.time() - start_time

print(f"   ✓ Anonymization completed in {anonymization_time:.4f} seconds")
print(f"   ✓ Anonymized text length: {len(anonymized_result_long.text)} characters")

# Step 3: De-anonymize the text
print("\n[3] De-anonymizing text...")
start_time = time.time()
deanonymized_long = deanonymizer_engine.deanonymize(
    anonymized_result_long.text,
    anonymized_result_long.items,
    {"DEFAULT": OperatorConfig("entity_counter_deanonymizer",
                               params={"entity_mapping": entity_mapping_long})}
)
deanonymization_time = time.time() - start_time

print(f"   ✓ De-anonymization completed in {deanonymization_time:.4f} seconds")

# Calculate total time
total_time = analysis_time + anonymization_time + deanonymization_time

# Display timing summary
print("\n" + "=" * 80)
print("TIMING SUMMARY")
print("=" * 80)
print(f"Analysis time:        {analysis_time:.4f} seconds ({analysis_time/total_time*100:.1f}%)")
print(f"Anonymization time:   {anonymization_time:.4f} seconds ({anonymization_time/total_time*100:.1f}%)")
print(f"De-anonymization time: {deanonymization_time:.4f} seconds ({deanonymization_time/total_time*100:.1f}%)")
print(f"Total processing time: {total_time:.4f} seconds")
print(f"Processing speed:      {len(long_text)/total_time:.0f} characters/second")
print("=" * 80)

# Verify correctness
print("\n" + "-" * 80)
print("VERIFICATION")
print("-" * 80)
match_original = long_text.strip() == deanonymized_long.text.strip()
print(f"✓ Original and de-anonymized text match: {match_original}")

if not match_original:
    print("\n⚠ Warning: Text mismatch detected!")
    print(f"Original length: {len(long_text.strip())}")
    print(f"De-anonymized length: {len(deanonymized_long.text.strip())}")
else:
    print("✓ Pseudonymization process completed successfully!")

# Display entity mapping statistics
print("\n" + "-" * 80)
print("ENTITY MAPPING STATISTICS")
print("-" * 80)
for entity_type, mappings in entity_mapping_long.items():
    print(f"{entity_type}: {len(mappings)} unique entities")
    # Show first few examples
    examples = list(mappings.items())[:3]
    for original, pseudonym in examples:
        print(f"  - '{original}' → {pseudonym}")
    if len(mappings) > 3:
        print(f"  ... and {len(mappings) - 3} more")


LONGER TEXT EXAMPLE - PSEUDONYMIZATION WITH TIMING

Text length: 1703 characters
Text word count: 230 words

--------------------------------------------------------------------------------

[1] Analyzing text for PII entities...
   ✓ Analysis completed in 0.1589 seconds
   ✓ Found 51 PII entities
   ✓ Entity types found: {'PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER', 'LOCATION', 'DATE_TIME', 'URL'}

[2] Anonymizing text...
   ✓ Anonymization completed in 0.0014 seconds
   ✓ Anonymized text length: 1700 characters

[3] De-anonymizing text...
   ✓ De-anonymization completed in 0.0006 seconds

TIMING SUMMARY
Analysis time:        0.1589 seconds (98.8%)
Anonymization time:   0.0014 seconds (0.9%)
De-anonymization time: 0.0006 seconds (0.4%)
Total processing time: 0.1609 seconds
Processing speed:      10585 characters/second

--------------------------------------------------------------------------------
VERIFICATION
---------------------------------------------------------------------------

In [13]:
# Display a sample comparison of original vs anonymized text
print("=" * 80)
print("SAMPLE TEXT COMPARISON")
print("=" * 80)

print("\n--- ORIGINAL TEXT (first 500 characters) ---")
print(long_text[:500] + "...")

print("\n--- ANONYMIZED TEXT (first 500 characters) ---")
print(anonymized_result_long.text[:500] + "...")

print("\n--- DE-ANONYMIZED TEXT (first 500 characters) ---")
print(deanonymized_long.text[:500] + "...")


SAMPLE TEXT COMPARISON

--- ORIGINAL TEXT (first 500 characters) ---

Dr. Sarah Johnson, a renowned cardiologist at Johns Hopkins Hospital in Baltimore, Maryland, 
recently published a groundbreaking research paper. She can be reached at sarah.johnson@hopkins.edu 
or by phone at (410) 555-0123. Her colleague, Dr. Michael Chen, who works at Massachusetts General 
Hospital in Boston, collaborated on the study. Dr. Chen's contact information includes his email 
mchen@mgh.harvard.edu and phone number 617-555-0198.

The research involved patients from various location...

--- ANONYMIZED TEXT (first 500 characters) ---

Dr. <PERSON_10>, a renowned cardiologist at Johns Hopkins Hospital in <LOCATION_17>, <LOCATION_16>, 
recently published a groundbreaking research paper. She can be reached at <EMAIL_ADDRESS_5> 
or by phone at <PHONE_NUMBER_5>. Her colleague, Dr. <PERSON_9>, who works at Massachusetts General 
Hospital in <LOCATION_15>, collaborated on the study. Dr. <PERSON_8>'s contact infor