In [1]:
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import AnonymizerConfig

# Analyze Text for PII Entities

Using Presidio Analyzer, analyze a text to identify PII entities. 

The Presidio analyzer is using pre-defined entity recognizers, and offers the option to create custom recognizers.


The following code sample will:

 - Set up the Analyzer engine - load the NLP module (spaCy model by default) and other PII recognizers
 - Call analyzer to get analyzed results for "PHONE_NUMBER" entity type


In [2]:
text_to_anonymize = "His name is Mr. Jones and his phone number is 212-555-5555"

In [3]:
analyzer = AnalyzerEngine()
analyzer_results = analyzer.analyze(text=text_to_anonymize, 
                                    entities=["PHONE_NUMBER"], 
                                    language='en')

print(analyzer_results)

[type: PHONE_NUMBER, start: 46, end: 58, score: 0.85]


# Create Custom PII Entity Recognizers

Presidio Analyzer comes with a pre-defined set of entity recognizers. It also allows adding new recognizers without changing the analyzer base code, **by creating custom recognizers.**

In the following example, we will create two new recognizers of type `PatternRecognizer` to identify titles and pronouns in the analyzed text.
A `PatternRecognizer` is a PII entity recognizer which uses regular expressions or deny-lists.

The following code sample will:

 - Create custom recognizers
 - Add the new custom recognizers to the analyzer
 - Call analyzer to get results from the new recognizers

In [4]:
# Create new recognizers

titles_recognizer = PatternRecognizer(supported_entity="TITLE",
                                      deny_list=["Mr.","Mrs.","Miss"])

pronoun_recognizer = PatternRecognizer(supported_entity="PRONOUN",
                                      deny_list=["he", "He", "his", "His", "she", "She", "hers" "Hers"])

# Add to recognizer registry
analyzer.registry.add_recognizer(titles_recognizer)
analyzer.registry.add_recognizer(pronoun_recognizer)

# Run analyzer with the TITLE and PRONOUN entities only
analyzer_results = analyzer.analyze(text=text_to_anonymize,
                            entities=["TITLE", "PRONOUN"],
                            language="en")

analyzer_results

[type: PRONOUN, start: 0, end: 3, score: 1.0,
 type: TITLE, start: 12, end: 15, score: 1.0,
 type: PRONOUN, start: 26, end: 29, score: 1.0]

Call Presidio Analyzer and get analyzed results with all the configured recognizers - default and new custom recognizers

In [5]:
analyzer_results = analyzer.analyze(text=text_to_anonymize, language='en')

analyzer_results

[type: PRONOUN, start: 0, end: 3, score: 1.0,
 type: TITLE, start: 12, end: 15, score: 1.0,
 type: PRONOUN, start: 26, end: 29, score: 1.0,
 type: PERSON, start: 16, end: 21, score: 0.85,
 type: PHONE_NUMBER, start: 46, end: 58, score: 0.85]

# Anonymize Text with Identified PII Entities

Presidio Anonymizer iterates over the Presidio Analyzer result, and provides anonymization capabilities for the identified text.
The anonymizer provides 5 types of anonymizers - replace, redact, mask, hash and encrypt. The default is **replace**

The following code sample will:

 - Setup the anonymizer engine
 - Create an anonymizer request - text to anonymize, list of anonymizers to apply and the results from the analyzer request
 - Anonymize the text


In [6]:
anonymizer = AnonymizerEngine()

anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize,
    analyzer_results=analyzer_results,    
    anonymizers_config={"DEFAULT": AnonymizerConfig("replace"), 
                        "PHONE_NUMBER": AnonymizerConfig("mask", {"type": "mask", 
                                                                  "masking_char" : "*", 
                                                                  "chars_to_mask" : 8, 
                                                                  "from_end" : False}),
                        "TITLE": AnonymizerConfig("redact", {})}
)

anonymized_results

Text: <PRONOUN> name is  <PERSON> and <PRONOUN> phone number is ********5555

Anonymized entities:
text: ********5555, anonymizer: mask, entity_type: PHONE_NUMBER, indices: (58, 70).
text: <PRONOUN>, anonymizer: replace, entity_type: PRONOUN, indices: (32, 41).
text: <PERSON>, anonymizer: replace, entity_type: PERSON, indices: (19, 27).
text: , anonymizer: redact, entity_type: TITLE, indices: (18, 18).
text: <PRONOUN>, anonymizer: replace, entity_type: PRONOUN, indices: (0, 9).