Skip to content

Commit

Permalink
Added "keep" anonymizer (#1062)
Browse files Browse the repository at this point in the history
  • Loading branch information
paulo-raca committed Apr 30, 2023
1 parent 61a5405 commit 24a76a8
Show file tree
Hide file tree
Showing 10 changed files with 324 additions and 7 deletions.
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@
All notable changes to this project will be documented in this file.

## [Unreleased]
### Added
#### Anonymizer
* Added `keep`, an no-op anonymizer that allows preserving some types of PII while keeping track of its position in anonymized output.

## [2.2.32] - 25.01.2023
### Changed
Expand Down Expand Up @@ -238,4 +241,4 @@ New endpoint for deanonymizing encrypted entities by the anonymizer.
[2.2.24]: https://github.com/microsoft/presidio/compare/2.2.23...2.2.24
[2.2.23]: https://github.com/microsoft/presidio/compare/2.2.2...2.2.23
[2.2.2]: https://github.com/microsoft/presidio/compare/2.2.1...2.2.2
[2.2.1]: https://github.com/microsoft/presidio/compare/2.2.0...2.2.1
[2.2.1]: https://github.com/microsoft/presidio/compare/2.2.0...2.2.1
3 changes: 2 additions & 1 deletion docs/anonymizer/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,14 +190,15 @@ with some other value by applying a certain operator (e.g. replace, mask, redact

## Built-in operators

Operator type | Operator name | Description | Parameters |
| Operator type | Operator name | Description | Parameters |
| --- | --- | --- | --- |
| Anonymize | replace | Replace the PII with desired value | `new_value`: replaces existing text with the given value.<br> If `new_value` is not supplied or empty, default behavior will be: <entity_type\> e.g: <PHONE_NUMBER\> |
| Anonymize | redact | Remove the PII completely from text | None |
| Anonymize | hash | Hashes the PII text | `hash_type`: sets the type of hashing. Can be either `sha256`, `sha512` or `md5`. <br> The default hash type is `sha256`. |
| Anonymize | mask | Replace the PII with a given character | `chars_to_mask`: the amount of characters out of the PII that should be replaced. <br> `masking_char`: the character to be replaced with. <br> `from_end`: Whether to mask the PII from it's end. |
| Anonymize | encrypt | Encrypt the PII using a given key | `key`: a cryptographic key used for the encryption. |
| Anonymize | custom | Replace the PII with the result of the function executed on the PII | `lambda`: lambda to execute on the PII data. The lambda return type must be a string. |
| Anonymize | keep | Preserver the PII unmodified | None |
| Deanonymize | decrypt | Decrypt the encrypted PII in the text using the encryption key | `key`: a cryptographic key used for the encryption is also used for the decryption. |

!!! note "Note"
Expand Down
127 changes: 127 additions & 0 deletions docs/samples/python/keep_entities.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "gothic-trademark",
"metadata": {},
"source": [
"# Keeping some PIIs from being anonymized\n",
"\n",
"This sample shows how to use Presidio's `keep` anonymizer to keep some of the identified PIIs in the output string"
]
},
{
"cell_type": "markdown",
"id": "roman-allergy",
"metadata": {},
"source": [
"### Set up imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "extensive-greensboro",
"metadata": {},
"outputs": [],
"source": [
"from presidio_anonymizer import AnonymizerEngine\n",
"from presidio_anonymizer.entities import RecognizerResult, OperatorConfig"
]
},
{
"cell_type": "markdown",
"id": "metropolitan-atlantic",
"metadata": {},
"source": [
"### Presidio Anonymizer: Keep person names\n",
"\n",
"This example input has 2 PIIs, an person name and a location. We configure the anonymizer to replace the location name with a placeholder, but keep the person name unmodified."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "medium-ridge",
"metadata": {},
"outputs": [],
"source": [
"engine = AnonymizerEngine()\n",
"\n",
"# Invoke the anonymize function with the text,\n",
"# analyzer results (potentially coming from presidio-analyzer)\n",
"# and 'keep' operator on <PERSON> PIIs\n",
"anonymize_result = engine.anonymize(\n",
" text=\"My name is James Bond, I live in London\",\n",
" analyzer_results=[\n",
" RecognizerResult(entity_type=\"PERSON\", start=11, end=21, score=0.8),\n",
" RecognizerResult(entity_type=\"LOCATION\", start=33, end=39, score=0.8),\n",
" ],\n",
" operators={\n",
" \"PERSON\": OperatorConfig(\"keep\"),\n",
" \"DEFAULT\": OperatorConfig(\"replace\"),\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1d2cabaa-4aa6-49cf-875d-4bdf407215b4",
"metadata": {},
"source": [
"### Result: Name unmodified, but tracked\n",
"\n",
"The person name is preserved in the result text, but remains tracked in the items list."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "421c2914-9b75-4c33-a270-e410d91d036b",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"text: My name is James Bond, I live in <LOCATION>\n",
"items:\n",
"[\n",
" {'start': 33, 'end': 43, 'entity_type': 'LOCATION', 'text': '<LOCATION>', 'operator': 'replace'},\n",
" {'start': 11, 'end': 21, 'entity_type': 'PERSON', 'text': 'James Bond', 'operator': 'keep'}\n",
"]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"anonymize_result"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
127 changes: 127 additions & 0 deletions e2e-tests/tests/test_anonymizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -274,3 +274,130 @@ def test_given_encrypt_called_then_decrypt_returns_the_original_encrypted_text()

decrypted_text = json.loads(decrypted_text_response)["text"]
assert decrypted_text == text_for_encryption


@pytest.mark.api
def test_keep_name():
request_body = """
{
"text": "hello world, my name is Jane Doe. My number is: 034453334",
"anonymizers": {
"NAME": { "type": "keep"},
"PHONE_NUMBER": { "type": "replace" }
},
"analyzer_results": [
{ "start": 24, "end": 32, "score": 0.80, "entity_type": "NAME" },
{ "start": 48, "end": 57, "score": 0.95, "entity_type": "PHONE_NUMBER" }
]
}
"""

response_status, response_content = anonymize(request_body)

expected_response = """
{
"text": "hello world, my name is Jane Doe. My number is: <PHONE_NUMBER>",
"items": [
{"operator": "replace", "entity_type": "PHONE_NUMBER", "start": 48, "end": 62, "text":"<PHONE_NUMBER>"},
{"operator": "keep", "entity_type": "NAME", "start": 24, "end": 32, "text":"Jane Doe"}
]
}
"""

assert response_status == 200
assert equal_json_strings(expected_response, response_content)


@pytest.mark.api
def test_overlapping_keep_first():
request_body = """
{
"text": "I'm George Washington Square Park",
"anonymizers": {
"NAME": { "type": "keep"},
"LOCATION": { "type": "replace" }
},
"analyzer_results": [
{ "start": 4, "end": 21, "score": 0.80, "entity_type": "NAME" },
{ "start": 11, "end": 33, "score": 0.80, "entity_type": "LOCATION" }
]
}
"""

response_status, response_content = anonymize(request_body)

expected_response = """
{
"text": "I'm George Washington<LOCATION>",
"items": [
{"operator": "replace", "entity_type": "LOCATION", "start": 21, "end": 31, "text":"<LOCATION>"},
{"operator": "keep", "entity_type": "NAME", "start": 4, "end": 21, "text":"George Washington"}
]
}
"""

assert response_status == 200
assert equal_json_strings(expected_response, response_content)


@pytest.mark.api
def test_overlapping_keep_second():
request_body = """
{
"text": "I'm George Washington Square Park",
"anonymizers": {
"NAME": { "type": "replace"},
"LOCATION": { "type": "keep" }
},
"analyzer_results": [
{ "start": 4, "end": 21, "score": 0.80, "entity_type": "NAME" },
{ "start": 11, "end": 33, "score": 0.80, "entity_type": "LOCATION" }
]
}
"""

response_status, response_content = anonymize(request_body)

expected_response = """
{
"text": "I'm <NAME>Washington Square Park",
"items": [
{"operator": "keep", "entity_type": "LOCATION", "start": 10, "end": 32, "text":"Washington Square Park"},
{"operator": "replace", "entity_type": "NAME", "start": 4, "end": 10, "text":"<NAME>"}
]
}
"""

assert response_status == 200
assert equal_json_strings(expected_response, response_content)


@pytest.mark.api
def test_overlapping_keep_both():
request_body = """
{
"text": "I'm George Washington Square Park",
"anonymizers": {
"DEFAULT": { "type": "keep" }
},
"analyzer_results": [
{ "start": 4, "end": 21, "score": 0.80, "entity_type": "NAME" },
{ "start": 11, "end": 33, "score": 0.80, "entity_type": "LOCATION" }
]
}
"""

response_status, response_content = anonymize(request_body)

expected_response = """
{
"text": "I'm George WashingtonWashington Square Park",
"items": [
{"operator": "keep", "entity_type": "LOCATION", "start": 21, "end": 43, "text":"Washington Square Park"},
{"operator": "keep", "entity_type": "NAME", "start": 4, "end": 21, "text":"George Washington"}
]
}
"""

assert response_status == 200
assert equal_json_strings(expected_response, response_content)
2 changes: 2 additions & 0 deletions presidio-anonymizer/presidio_anonymizer/operators/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from .redact import Redact
from .replace import Replace
from .custom import Custom
from .keep import Keep
from .encrypt import Encrypt
from .decrypt import Decrypt
from .aes_cipher import AESCipher
Expand All @@ -16,6 +17,7 @@
"Hash",
"Mask",
"Redact",
"Keep",
"Replace",
"Custom",
"Encrypt",
Expand Down
28 changes: 28 additions & 0 deletions presidio-anonymizer/presidio_anonymizer/operators/keep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
"""Keeps the PII text unmodified."""
from typing import Dict

from presidio_anonymizer.operators import Operator, OperatorType


class Keep(Operator):
"""No-op anonymizer that keeps the PII text unmodified.
This is useful when you don't want to anonymize some types of PII,
but wants to keep track of it with the other PIIs.
"""

def operate(self, text: str = None, params: Dict = None) -> str:
""":return: original text."""
return text

def validate(self, params: Dict = None) -> None:
"""Keep does not require any paramters so no validation is needed."""
pass

def operator_name(self) -> str:
"""Return operator name."""
return "keep"

def operator_type(self) -> OperatorType:
"""Return operator type."""
return OperatorType.Anonymize
21 changes: 21 additions & 0 deletions presidio-anonymizer/tests/operators/test_keep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
import pytest

from presidio_anonymizer.operators import Keep


@pytest.mark.parametrize(
# fmt: off
"params",
[
{"new_value": ""},
{},
],
# fmt: on
)
def when_given_valid_value_then_same_string_returned(params):
text = Keep().operate("bla", params)
assert text == "bla"


def test_when_validate_anonymizer_then_correct_name():
assert Keep().operator_name() == "keep"
12 changes: 10 additions & 2 deletions presidio-anonymizer/tests/operators/test_operators_factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,16 @@

def test_given_anonymizers_list_then_all_classes_are_there():
anonymizers = OperatorsFactory.get_anonymizers()
assert len(anonymizers) == 6
for class_name in ["hash", "mask", "redact", "replace", "encrypt", "custom"]:
assert len(anonymizers) == 7
for class_name in [
"hash",
"mask",
"redact",
"replace",
"encrypt",
"custom",
"keep",
]:
assert anonymizers.get(class_name)


Expand Down
2 changes: 1 addition & 1 deletion presidio-anonymizer/tests/operators/test_redact.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# fmt: on
)
def test_given_value_for_redact_then_we_return_empty_value(params):
text = Redact().operate("", params)
text = Redact().operate("bla", params)
assert text == ""


Expand Down
4 changes: 2 additions & 2 deletions presidio-anonymizer/tests/test_anonymizer_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@

def test_given_request_anonymizers_return_list():
engine = AnonymizerEngine()
expected_list = ["hash", "mask", "redact", "replace", "custom", "encrypt"]
anon_list = engine.get_anonymizers()
expected_list = {"hash", "mask", "redact", "replace", "custom", "keep", "encrypt"}
anon_list = set(engine.get_anonymizers())

assert anon_list == expected_list

Expand Down

0 comments on commit 24a76a8

Please sign in to comment.