# 1. PII Detection and Anonymization with Presidio
## 1.1 Overview
In this notebook, we will focus on the detection and de-identification of Personally Identifiable Information (PII) using Presidio, an open-source tool developed by Microsoft.

PII refers to any information that can be used to identify an individual. Examples of PII include names, social security numbers, email addresses, phone numbers, and more. In the wrong hands, PII can be used for malicious purposes such as identity theft, fraud, and phishing attacks, among others. Therefore, it's crucial to ensure that PII is adequately protected, especially when dealing with large datasets.

Presidio offers a robust framework for recognizing and anonymizing PII across multiple languages and data sources. It uses pre-defined recognizers to identify different types of PII, and it provides several anonymization techniques such as masking, redaction, and replacement to de-identify the data.

In this hands-on lab, we will walk you through the process of using Presidio analyzer and anonymizer engines to analyze a text for PII and anonymize it. 

By the end of this lab, you will have a solid understanding of how to use Presidio for PII detection and de-identification, and you will be equipped with the skills to use this tool in your data privacy and security projects.

## 1.2 Simple flow of PII detection with presidio
Presidio offers a straightforward process for PII detection, which can be broken down into the following steps:
1. Initialize the AnalyzerEngine: The AnalyzerEngine is a core component in Presidio, it contains a set of predefined recognizers to indentify different types of PII
2. Define a text: Specify the text that you want to analyze for PII
3. Analyze the text: By calling analyze function. This methods returns a list of AnalyzerResult object, each representing a piece of detected PII

In [36]:
from presidio_analyzer import AnalyzerEngine


# Define the text to analyze
text = "My name is John and I live in France"

# Initialize PII analyzer engine
analyzer = AnalyzerEngine()


# Use the analyzer to detect the PII in the text
results = analyzer.analyze(text=text, language='en', entities=["LOCATION", "PERSON"])

# Print PII detection the results
print(f"Detected PII: {results}")


Detected PII: [type: PERSON, start: 11, end: 15, score: 0.85, type: LOCATION, start: 30, end: 36, score: 0.85]


4. Review the result: The result returns the `entity_type` (e.g., PERSON), the starting position, and the ending position in the text. You can format this into a more readable result.

In [37]:
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- John as PERSON
- France as LOCATION


## 1.2 PII Anonymizer with Presidio

After detecting PII in the text using Presidio's Analyzer engine, the next step is to anonymize this information. This is where Presidio's Anonymizer engine comes into play. The Anonymizer engine provides several methods to anonymize detected PII, including replacement, redaction, and masking.

Here's a simple flow of PII anonymization with Presidio:

1. **Import the necessary libraries**: In addition to the `AnalyzerEngine`, we also need to import the `AnonymizerEngine` and `AnonymizerConfig` from `presidio_anonymizer`.
2. **Initialize the Anonymizer engine**: Similar to the Analyzer engine, we need to create an instance of the Anonymizer engine.

In [38]:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

# Initialize PII anonymizer engine
anonymizer = AnonymizerEngine()

# Print the PII detection results from previous step
print(f"Detected PII: {results}")

Detected PII: [type: PERSON, start: 11, end: 15, score: 0.85, type: LOCATION, start: 30, end: 36, score: 0.85]


3. **Define the anonymization configuration**: The AnonymizerConfig class allows us to specify the anonymization method and parameters. For example, we can use the "replace" method to replace all detected PII with a specific string.

In [41]:
# Use anonymizer enginze to anonymize the PII in the text
anonymized_results = anonymizer.anonymize(text=text, 
                                          analyzer_results= results,
                                          operators={"PERSON": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
                                                    "LOCATION": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"})})
print(anonymized_results)

text: My name is <ANONYMIZED> and I live in <ANONYMIZED>
items:
[
    {'start': 38, 'end': 50, 'entity_type': 'LOCATION', 'text': '<ANONYMIZED>', 'operator': 'replace'},
    {'start': 11, 'end': 23, 'entity_type': 'PERSON', 'text': '<ANONYMIZED>', 'operator': 'replace'}
]



We can also encrypte the PII entities by a key

In [42]:
# Use anonymizer enginze to anonymize the PII in the text
# Define the anonymizer_config
anonymized_results = anonymizer.anonymize(text=text, 
                                          analyzer_results= results,
                                          operators={"PERSON": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C&F)J"}),
                                                    "LOCATION": OperatorConfig("encrypt", {"key": "WmZq4t7w!z%C&F)J"})})
print(anonymized_results)

text: My name is TfRw3vE9yGQGVHeBKR3gu1AaY9z9wHrn5ATCYXM8KwQ= and I live in g3hE8t4goo65WrJXTc1ZO+f17Pyzp5/u5BM1iWymWgQ=
items:
[
    {'start': 70, 'end': 114, 'entity_type': 'LOCATION', 'text': 'g3hE8t4goo65WrJXTc1ZO+f17Pyzp5/u5BM1iWymWgQ=', 'operator': 'encrypt'},
    {'start': 11, 'end': 55, 'entity_type': 'PERSON', 'text': 'TfRw3vE9yGQGVHeBKR3gu1AaY9z9wHrn5ATCYXM8KwQ=', 'operator': 'encrypt'}
]



Decrypts back to orginal text if needed

In [43]:
from presidio_anonymizer import DeanonymizeEngine
from presidio_anonymizer.entities import OperatorResult, OperatorConfig

# Initialize the engine:
engine = DeanonymizeEngine()

# Invoke the deanonymize function with the text, anonymizer results and
# Operators to define the deanonymization type.
result = engine.deanonymize(
    text="My name is S184CMt9Drj7QaKQ21JTrpYzghnboTF9pn/neN8JME0=",
    entities=[
        OperatorResult(start=11, end=55, entity_type="PERSON"),
    ],
    operators={"DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C&F)J"})},
)

print(result)

text: My name is Chloë
items:
[
    {'start': 11, 'end': 16, 'entity_type': 'PERSON', 'text': 'Chloë', 'operator': 'decrypt'}
]



## 1.3 Customize Presidio
Next, we'll go over ways to customize Presidio to specific needs by adding PII recognizers, using context words, NER models and more.
Presidio offers support for:
- Deny-list based PII recognition: For instances such as identifying titles in text, you may want to utilize a predefined list of titles (e.g., Sir, Mr., Mrs.) that should be excluded from your data.
- Regular-expressions based PII recognition: You might need to employ regular expressions to pinpoint customized entity patterns, like a company's user ID that begins with a specific prefix followed by a sequence of digits.
- Rule based logic recognizer: Develop any customize recognizer with your rule based logic.
- Leverage the additional models or services:  Presidio enables the addition of new models and languages (for example, transformers from HuggingFace) or the incorporation of external PII detection services or frameworks (such as Azure AI Language).

## 1.3.1 Deny-list based PII recognition
In this example, we will pass a short list of tokens which should be marked as PII if detected.

First, let's try to define the tokens we want to treat as PII. In this example it would be a list of tittles

In [2]:
titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

Second, let's create a PatternRecognize which would scan for those title, by parsing a deny_list

In [5]:
from presidio_analyzer import PatternRecognizer
titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)

# Call our analyzer engine with the new recognizer
text = "Hello, Mr. John Doe"
result = titles_recognizer.analyze(text, entities=["TITLE"])
print(result)

[type: TITLE, start: 7, end: 10, score: 1.0]


Now, if you want to detect both title and person name from this text. You need to add the title_recognizer into AnalyzerEngine

In [20]:
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()
# Add title_recognizer to the analyzer
analyzer.registry.add_recognizer(titles_recognizer)
# Now test the analyzer with the new recognizer
text = "Hello, Mr. John Doe"
results = analyzer.analyze(text, language="en", entities=["TITLE", "PERSON"])
# print(results)
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")


Identified the following PII:
- Mr. as TITLE
- John Doe as PERSON


### 1.3.2 Regular-expressions based PII recognition
In this example, we'll showcase an simple example to add a recoginzer based on regular expression. Let's assume we want to be extremely conservative and treat any token which contains a number as PII

In [22]:
from presidio_analyzer import Pattern, PatternRecognizer

# Define the regex pattern in a Presidio Pattern object
number_pattern = Pattern(name="numbers_pattern", regex="\d+", score=0.5)

# Define a PatternRecognizer with the number_patterns
number_recognizer = PatternRecognizer(supported_entity="NUMBER", patterns=[number_pattern])

# Test the recognizer
text = "My phone number is 555-1234"
results = number_recognizer.analyze(text, entities=["NUMBER"])
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- 555 as NUMBER
- 1234 as NUMBER


It's important that the new recognizer added can contain errors, both false-positive and false-negative, which would impact the entire performance of presidio. Please consider testing each recognizer on a representative dataset

### 1.3.4 Rule based logic recognizer
Taking the numbers recognizer above one step further, let's say we also would like to detect numbers within words. For example "Number One". We can leverage the underlying spaCy token attributes, or write our own logic to detect such entities (which are not able to detect by using regular expression or deny-list)

- In this example, we would create a new class, which extends from EntityRecognizer, the basic recognizer in Presidio. This abstract class requires us to implement the load method and analyze method
- Each customize recognizer accepts an object of type NlpArtifacts, which holds pre-computed attributes on the input text

In [19]:
from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult, AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpArtifacts

# New recognizer class
class MyNumbersRecognizer(EntityRecognizer):
    expected_confidence_level = 0.7
    def load(self) -> None:
        pass

    def analyze(
            self, text: str, entities: List[str], nlp_artifacts: NlpArtifacts
    ) -> List[RecognizerResult]:
        # Iterate over the spaCy tokens, and call token.like_num
        for token in nlp_artifacts.tokens:
            if token.like_num:
                result = RecognizerResult(
                        entity_type="NUMBER",
                        start=token.idx,
                        end=token.idx + len(token),
                        score=self.expected_confidence_level
                )
                results.append(result)
        return results
    
# Now create an instance of MyNumbersRecognizer with supported_entities is NUMBER
new_numbers_recognizer = MyNumbersRecognizer(supported_entities = ["NUMBER"])
# Create an instance of the analyzer engine
analyzer = AnalyzerEngine()
# Add the new number recognizer to the analyzer
analyzer.registry.add_recognizer(new_numbers_recognizer)

# Test the analyzer with the new recognizer
text = "My name is Harry and my phone number is Five Five Five One Two Three Four"
results = analyzer.analyze(text, language="en")
print("Identified the following PII:")
for result in results:
    print(f"- {text[result.start:result.end]} as {result.entity_type}")

Identified the following PII:
- Harry as PERSON
- Five as NUMBER
- Five as NUMBER
- Five as NUMBER
- One as NUMBER
- Two as NUMBER
- Three as NUMBER
- Four as NUMBER
