# Simple flow
a simple call to Presidio Analyzer

## 1. PII detection with presidio

In [24]:
from presidio_analyzer import AnalyzerEngine


# Define the text to analyze
text = "My name is John and my driver's license number is C654-321-123-456."

# Initialize PII analyzer engine
analyzer = AnalyzerEngine()


# Use the analyzer to detect the PII in the text
results = analyzer.analyze(text=text, language='en', entities=["US_DRIVER_LICENSE", "PERSON"])

# Print PII detection the results
print(f"Detected PII: {results}")



Detected PII: [type: PERSON, start: 11, end: 15, score: 0.85, type: US_DRIVER_LICENSE, start: 50, end: 54, score: 0.6499999999999999]


The result return the entity_type (PERSON), the starting position and the ending position. You can change it into more understable result

In [6]:
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- Mr. as TITLE
- John Doe as PERSON


## 2. PII Anonymizer with Presidio

In [26]:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize PII anonymizer engine
anonymizer = AnonymizerEngine()

# Print the PII detection results from previous step
print(f"Detected PII: {results}")

Detected PII: [type: PERSON, start: 11, end: 15, score: 0.85, type: US_DRIVER_LICENSE, start: 50, end: 54, score: 0.6499999999999999]


Method 1: Replace the PII entities by another

In [28]:
# Use anonymizer enginze to anonymize the PII in the text
anonymized_results = anonymizer.anonymize(text=text, 
                                          analyzer_results= results,
                                          operators={"PERSON": OperatorConfig("encrypt", {"new_value": "<ANONYMIZED>"}),
                                                    "US_DRIVER_LICENSE": OperatorConfig("encrypt", {"new_value": "<ANONYMIZED>"})})
print(anonymized_results)

text: My name is qHuz98Mnv8xwsRy2rBnCga7rVS7gEpJbeA6jUinK14A= and my driver's license number is k7QD9VQhLbcyq/pmPrIl+TgkbTU1Aj80u8pG0vOPx8c=-321-123-456.
items:
[
    {'start': 90, 'end': 134, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'k7QD9VQhLbcyq/pmPrIl+TgkbTU1Aj80u8pG0vOPx8c=', 'operator': 'encrypt'},
    {'start': 11, 'end': 55, 'entity_type': 'PERSON', 'text': 'qHuz98Mnv8xwsRy2rBnCga7rVS7gEpJbeA6jUinK14A=', 'operator': 'encrypt'}
]



Method 2: Encrypte the PII entities by a key

In [29]:
# Use anonymizer enginze to anonymize the PII in the text
# Define the anonymizer_config
anonymized_results = anonymizer.anonymize(text=text, 
                                          analyzer_results= results,
                                          operators={"PERSON": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C&F)J"}),
                                                    "US_DRIVER_LICENSE": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C&F)J"})})
print(anonymized_results)

text: My name is H+gA6XPMOISwvix0j+fJCeqvR7P7covhLYDNOF9tic0= and my driver's license number is Xce8Djo1mpTGJJvlU6tb1FS6anbMMpe1507bg8CItT4=-321-123-456.
items:
[
    {'start': 90, 'end': 134, 'entity_type': 'US_DRIVER_LICENSE', 'text': 'Xce8Djo1mpTGJJvlU6tb1FS6anbMMpe1507bg8CItT4=', 'operator': 'encrypt'},
    {'start': 11, 'end': 55, 'entity_type': 'PERSON', 'text': 'H+gA6XPMOISwvix0j+fJCeqvR7P7covhLYDNOF9tic0=', 'operator': 'encrypt'}
]



Decrypts back to orginal text if needed

In [30]:
from presidio_anonymizer import DeanonymizeEngine
from presidio_anonymizer.entities import OperatorResult, OperatorConfig

# Initialize the engine:
engine = DeanonymizeEngine()

# Invoke the deanonymize function with the text, anonymizer results and
# Operators to define the deanonymization type.
result = engine.deanonymize(
    text="My name is S184CMt9Drj7QaKQ21JTrpYzghnboTF9pn/neN8JME0=",
    entities=[
        OperatorResult(start=11, end=55, entity_type="PERSON"),
    ],
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C&F)J"})},
)

print(result)

text: My name is Chloë
items:
[
    {'start': 11, 'end': 16, 'entity_type': 'PERSON', 'text': 'Chloë', 'operator': 'decrypt'}
]



# Customize Presidio

Next, we'll go over ways to customize Presidio to specific needs by adding PII recognizers, using context words, NER models and more.
Presidio is currently supporting:
- Deny-list based PII recognition: For example, in the title detection scenario, you might have a pre-defined list of all the person title such as Sir, Mr., Mrs. etc. which you want to remove from your text
- Regular-expressions based PII recognition: In some situation, you might want to use the regular expressions to define customized entities. For example the user identification of a company can start with some pre-defined text and followed by some numbers.
- Addional models and languages: The Presidio is currently using spaCy and stanza 

# Deny-list based PII recognition
In this example, we will pass a short list of tokens which should be marked as PII if detected.

First, let's try to define the tokens we want to treat as PII. In this example it would be a list of tittles

In [2]:
titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

Second, let's create a PatternRecognize which would scan for those title, by parsing a deny_list

In [5]:
from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

# Call our analyzer engine with the new recognizer
text = "Hello, Mr. John Doe"
result = titles_recognizer.analyze(text, entities=["TITLE"])
print(result)

[type: TITLE, start: 7, end: 10, score: 1.0]


Now, if you want to detect both title and person name from this text. You need to add the title_recognizer into AnalyzerEngine

In [20]:
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
# Add title_recognizer to the analyzer
analyzer.registry.add_recognizer(titles_recognizer)
# Now test the analyzer with the new recognizer
text = "Hello, Mr. John Doe"
results = analyzer.analyze(text, language="en", entities=["TITLE", "PERSON"])
# print(results)
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")


Identified the following PII:
- Mr. as TITLE
- John Doe as PERSON


# Regular-expressions based PII recognition
In this example, we'll showcase an simple example to add a recoginzer based on regular expression. Let's assume we want to be extremely conservative and treat any token which contains a number as PII

In [22]:
from presidio_analyzer import Pattern, PatternRecognizer

# Define the regex pattern in a Presidio Pattern object
number_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define a PatternRecognizer with the number_patterns
number_recognizer = PatternRecognizer(supported_entity="NUMBER", patterns=[number_pattern])

# Test the recognizer
text = "My phone number is 555-1234"
results = number_recognizer.analyze(text, entities=["NUMBER"])
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- 555 as NUMBER
- 1234 as NUMBER


It's important that the new recognizer added can contain errors, both false-positive and false-negative, which would impact the entire performance of presidio. Please consider testing each recognizer on a representative dataset

# Rule based logic recognizer
Taking the numbers recognizer above one step further, let's say we also would like to detect numbers within words. For example "Number One". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities (which are not able to detect by using regular expression or deny-list)

- In this example, we would create a new class, which extends from EntityRecognizer, the basic recognizer in Presidio. This abstract class requires us to implement the load method and analyze method
- Each customize recognizer accepts an object of type NlpArtifacts, which holds pre-computed attributes on the input text

In [19]:
from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult, AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpArtifacts

# New recognizer class
class MyNumbersRecognizer(EntityRecognizer):
    expected_confidence_level = 0.7
    def load(self) -> None:
        pass

    def analyze(
            self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        # Iterate over the spaCy tokens, and call token.like_num
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                        entity_type="NUMBER",
                        start=token.idx,
                        end=token.idx + len(token),
                        score=self.expected_confidence_level
                )
                results.append(result)
        return results
    
# Now create an instance of MyNumbersRecognizer with supported_entities is NUMBER
new_numbers_recognizer = MyNumbersRecognizer(supported_entities = ["NUMBER"])
# Create an instance of the analyzer engine
analyzer = AnalyzerEngine()
# Add the new number recognizer to the analyzer
analyzer.registry.add_recognizer(new_numbers_recognizer)

# Test the analyzer with the new recognizer
text = "My name is Harry and my phone number is Five Five Five One Two Three Four"
results = analyzer.analyze(text, language="en")
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- Harry as PERSON
- Five as NUMBER
- Five as NUMBER
- Five as NUMBER
- One as NUMBER
- Two as NUMBER
- Three as NUMBER
- Four as NUMBER
