# SortCode Regex Matching w/ Context Awareness

Presidio has a internal mechanism for leveraging context words. This mechanism would increse the detection confidence of a PII entity in case a specific word appears before or after it.

In this example we would first implement a zip code recognizer without context, and then add context to see how the confidence changes. Zip regex patterns (essentially 5 digits) are very week, so we would want the initial confidence to be low, and increased with the existence of context words.

In [1]:
from typing import List
import pprint

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

In [2]:
sortcode_pattern_full = Pattern(name="Sort Code Perfect", regex=r"\b\d{2}[-\s]\d{2}[-\s]\d{2}\b", score=0.5)
sortcode_pattern = Pattern(name="Sort Code (weak)", regex=r"\b\d{6}\b", score=0.01)

# Define the recognizer with the defined pattern
sortcode_recognizer = PatternRecognizer(supported_entity="SORTCODE", patterns = [sortcode_pattern, sortcode_pattern_full])

registry = RecognizerRegistry()
registry.add_recognizer(sortcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
results = analyzer.analyze(text="My sort code is 908786",language="en")
print(f"Result:\n {results}")

Result:
 [type: SORTCODE, start: 16, end: 22, score: 0.01]


So this is working, but would catch any 5 digit string. This is why we set the score to 0.01. Let's use context words to increase score:

In [30]:
# Define the recognizer with the defined pattern and context words
sortcode_pattern_full = Pattern(name="Sort Code Perfect", regex=r"\b\d{2}[-\s]\d{2}[-\s]\d{2}\b", score=0.5) # Standard pattern for sort code
sortcode_pattern = Pattern(name="Sort Code (weak)", regex=r"\b\d{6}\b", score=0.01) # Sequence of 6 digits, need context to confirm if sort code

sortcode_recognizer = PatternRecognizer(supported_entity="SORTCODE", 
                                       patterns = [sortcode_pattern, sortcode_pattern_full],
                                       context = [r"sort code", "sortcode", "sort"]) # Score only increased when we added 'sort' to context words

When creating an AnalyzerEngine we can provide our own context enhancement logic by passing it to context_aware_enhancer parameter. AnalyzerEngine will create LemmaContextAwareEnhancer by default if not passed, which will enhance score of each matched result if it's recognizer holds context words and those words are found in context of the matched entity.

In [31]:
registry = RecognizerRegistry()
registry.add_recognizer(sortcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
results = analyzer.analyze(text="My sort code is 902107",language="en")
print("Result:")
print(results)

Result:
[type: SORTCODE, start: 16, end: 22, score: 0.4]


The confidence score is now 0.4, instead of 0.01. because LemmaContextAwareEnhancer default context similarity factor is 0.35 and default minimum score with context similarity is 0.4, we can change that by passing context_similarity_factor and min_score_with_context_similarity parameters of LemmaContextAwareEnhancer to other than values, for example:

In [32]:
registry = RecognizerRegistry()
registry.add_recognizer(sortcode_recognizer)
analyzer = AnalyzerEngine(
    registry=registry,
    context_aware_enhancer=
        LemmaContextAwareEnhancer(context_similarity_factor=0.45, min_score_with_context_similarity=0.4))

# Test
results = analyzer.analyze(text="My sort code is 902103",language="en")
print("Result:")
print(results)

Result:
[type: SORTCODE, start: 16, end: 22, score: 0.46]


The confidence score is now 0.46 because it got enhanced from 0.01 with 0.45 and is more the minimum of 0.4

In [33]:
results = analyzer.analyze(text="My sort code is 902103",language="en", return_decision_process = True)
decision_process = results[0].analysis_explanation

pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)

Decision process output:

{'original_score': 0.01,
 'pattern': '\\b\\d{6}\\b',
 'pattern_name': 'Sort Code (weak)',
 'recognizer': 'PatternRecognizer',
 'regex_flags': regex.I|M|S,
 'score': 0.46,
 'score_context_improvement': 0.45,
 'supportive_context_word': 'sort',
 'textual_explanation': None,
 'validation_result': None}


## Outer context

Presidio supports passing a list of outer context in analyzer level, this is useful if the text is coming from a specific column or a specific user input etc. notice how the "zip" context word doesn't appear in the text but still enhance the confidence score from 0.01 to 0.4:

In [None]:
# Define the recognizer with the defined pattern and context words
zipcode_recognizer = PatternRecognizer(supported_entity="US_ZIP_CODE",
                                       patterns = [zipcode_pattern],
                                       context= ["zip","zipcode"])

registry = RecognizerRegistry()
registry.add_recognizer(zipcode_recognizer)
analyzer = AnalyzerEngine(registry=registry)

# Test
result = analyzer.analyze(text="My code is 90210",language="en", context=["zip"])
print("Result:")
print(result)

# Full Sortcode Script

In [66]:
from typing import List
import pprint

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

# Define the recognizer with the defined pattern and context words 
# Creating 2 patterns - 1 for a perfect match, 1 for badly formed entries that require some context
# May need to modify the weak one -> think about what cases we want captured by this one
sortcode_pattern_full = Pattern(name="Sort Code Perfect", regex=r"\b\d{2}[-\s]\d{2}[-\s]\d{2}\b", score=1.0) # Standard pattern for sort code
sortcode_pattern = Pattern(name="Sort Code (weak)", regex=r"\b\d{6}\b", score=0.01) # Sequence of 6 digits, need context to confirm if sort code

# Score only increases when we added 'sort' to context words - it doesnt like strings with spaces
sortcode_recognizer = PatternRecognizer(supported_entity="SORTCODE", 
                                       patterns = [sortcode_pattern, sortcode_pattern_full],
                                       context = ["sortcode", "sort"])
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(sortcode_recognizer)

context_aware_enhancer = LemmaContextAwareEnhancer(context_similarity_factor=0.45, min_score_with_context_similarity=0.4)

analyzer = AnalyzerEngine(registry=registry, context_aware_enhancer=context_aware_enhancer)

# Test
results = analyzer.analyze(text="My sort code is 90-21-03. Second sort code 987654",language="en")
print("Result:")
print(results)

Result:
[type: SORTCODE, start: 16, end: 24, score: 1.0, type: DATE_TIME, start: 16, end: 24, score: 0.85, type: SORTCODE, start: 43, end: 49, score: 0.46, type: US_DRIVER_LICENSE, start: 43, end: 49, score: 0.01]


In [67]:
results = analyzer.analyze(text="My sort code is 90-21-03. Second sort code 987654",language="en", return_decision_process = True)
# decision_process = results[0].analysis_explanation
results

[type: SORTCODE, start: 16, end: 24, score: 1.0,
 type: DATE_TIME, start: 16, end: 24, score: 0.85,
 type: SORTCODE, start: 43, end: 49, score: 0.46,
 type: US_DRIVER_LICENSE, start: 43, end: 49, score: 0.01]

In [56]:
pp = pprint.PrettyPrinter()
print("Decision process output:\n")
pp.pprint(decision_process.__dict__)

Decision process output:

{'original_score': 0.01,
 'pattern': '\\b\\d{6}\\b',
 'pattern_name': 'Sort Code (weak)',
 'recognizer': 'PatternRecognizer',
 'regex_flags': regex.I|M|S,
 'score': 0.46,
 'score_context_improvement': 0.45,
 'supportive_context_word': 'sort',
 'textual_explanation': None,
 'validation_result': None}


In [60]:
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine, OperatorConfig
from presidio_anonymizer.operators import Operator, OperatorType
from typing import Dict

In [61]:
class InstanceCounterAnonymizer(Operator):
    """
    Anonymizer which replaces the entity value
    with an instance counter per entity.
    """

    REPLACING_FORMAT = "<{entity_type}_{index}>"

    def operate(self, text: str, params: Dict = None) -> str:
        """Anonymize the input text."""

        entity_type: str = params["entity_type"]

        # entity_mapping is a dict of dicts containing mappings per entity type
        entity_mapping: Dict[Dict:str] = params["entity_mapping"]

        entity_mapping_for_type = entity_mapping.get(entity_type)
        if not entity_mapping_for_type:
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=0
            )
            entity_mapping[entity_type] = {}

        else:
            if text in entity_mapping_for_type:
                return entity_mapping_for_type[text]

            previous_index = self._get_last_index(entity_mapping_for_type)
            new_text = self.REPLACING_FORMAT.format(
                entity_type=entity_type, index=previous_index + 1
            )

        entity_mapping[entity_type][text] = new_text
        return new_text

    @staticmethod
    def _get_last_index(entity_mapping_for_type: Dict) -> int:
        """Get the last index for a given entity type."""

        def get_index(value: str) -> int:
            return int(value.split("_")[-1][:-1])

        indices = [get_index(v) for v in entity_mapping_for_type.values()]
        return max(indices)

    def validate(self, params: Dict = None) -> None:
        """Validate operator parameters."""

        if "entity_mapping" not in params:
            raise ValueError("An input Dict called `entity_mapping` is required.")
        if "entity_type" not in params:
            raise ValueError("An entity_type param is required.")

    def operator_name(self) -> str:
        return "entity_counter"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

In [None]:
from typing import List
import pprint

from presidio_analyzer import AnalyzerEngine, PatternRecognizer, EntityRecognizer, Pattern, RecognizerResult
from presidio_analyzer.recognizer_registry import RecognizerRegistry
from presidio_analyzer.nlp_engine import NlpEngine, SpacyNlpEngine, NlpArtifacts
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer

# Define the recognizer with the defined pattern and context words 
# Creating 2 patterns - 1 for a perfect match, 1 for badly formed entries that require some context
# May need to modify the weak one -> think about what cases we want captured by this one
sortcode_pattern_full = Pattern(name="Sort Code Perfect", regex=r"\b\d{2}[-\s]\d{2}[-\s]\d{2}\b", score=1.0) # Standard pattern for sort code
sortcode_pattern = Pattern(name="Sort Code (weak)", regex=r"\b\d{6}\b", score=0.01) # Sequence of 6 digits, need context to confirm if sort code

# Score only increases when we added 'sort' to context words - it doesnt like strings with spaces
sortcode_recognizer = PatternRecognizer(supported_entity="SORTCODE", 
                                       patterns = [sortcode_pattern, sortcode_pattern_full],
                                       context = ["sortcode", "sort"])
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(sortcode_recognizer)

context_aware_enhancer = LemmaContextAwareEnhancer(context_similarity_factor=0.45, min_score_with_context_similarity=0.4)

analyzer = AnalyzerEngine(registry=registry, context_aware_enhancer=context_aware_enhancer)

In [65]:
# Create Anonymizer engine and add the custom anonymizer
anonymizer_engine = AnonymizerEngine()
anonymizer_engine.add_anonymizer(InstanceCounterAnonymizer)

# Create a mapping between entity types and counters
entity_mapping = dict()

text = "My sort code is 90-21-03. Second sort code 987654"
analyzer_results = analyzer.analyze(text="My sort code is 90-21-03. Second sort code 987654",language="en")

# Anonymize the text
anonymized_result = anonymizer_engine.anonymize(
    text,
    analyzer_results,
    {
        "DEFAULT": OperatorConfig(
            "entity_counter", {"entity_mapping": entity_mapping}
        )
    },
)

print(anonymized_result.text)

My sort code is <SORTCODE_0>. the second sort code is 908783.


In [None]:
class InstanceCounterDeanonymizer(Operator):
    """
    Deanonymizer which replaces the unique identifier 
    with the original text.
    """

    def operate(self, text: str, params: Dict = None) -> str:
        """Anonymize the input text."""

        entity_type: str = params["entity_type"]

        # entity_mapping is a dict of dicts containing mappings per entity type
        entity_mapping: Dict[Dict:str] = params["entity_mapping"]

        if entity_type not in entity_mapping:
            raise ValueError(f"Entity type {entity_type} not found in entity mapping!")
        if text not in entity_mapping[entity_type].values():
            raise ValueError(f"Text {text} not found in entity mapping for entity type {entity_type}!")

        return self._find_key_by_value(entity_mapping[entity_type], text)

    @staticmethod
    def _find_key_by_value(entity_mapping, value):
        for key, val in entity_mapping.items():
            if val == value:
                return key
        return None
    
    def validate(self, params: Dict = None) -> None:
        """Validate operator parameters."""

        if "entity_mapping" not in params:
            raise ValueError("An input Dict called `entity_mapping` is required.")
        if "entity_type" not in params:
            raise ValueError("An entity_type param is required.")

    def operator_name(self) -> str:
        return "entity_counter_deanonymizer"

    def operator_type(self) -> OperatorType:
        return OperatorType.Deanonymize

deanonymizer_engine = DeanonymizeEngine()
deanonymizer_engine.add_deanonymizer(InstanceCounterDeanonymizer)

deanonymized = deanonymizer_engine.deanonymize(
    anonymized_result.text, 
    anonymized_result.items, 
    {"DEFAULT": OperatorConfig("entity_counter_deanonymizer", 
                               params={"entity_mapping": entity_mapping})}
)

In [None]:
print("anonymized text:")
pprint(anonymized_result.text)
print("de-anonymized text:")
pprint(deanonymized.text)