# Reversible data anonymization with Microsoft Presidio

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/reversible_anonymization.ipynb)


## Use case

We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. 

## Overview

We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:

1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:
```
    {
        "PERSON": {
            "<anonymized>": "<original>",
            "John Doe": "Slim Shady"
        },
        "PHONE_NUMBER": {
            "111-111-1111": "555-555-5555"
        }
        ...
    }
```

2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.

Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.

## Quickstart



In [1]:
# Install necessary packages
# ! pip install langchain langchain-experimental openai
# ! python -m spacy download en_core_web_lg

This is a standard `PresidioAnonymizer` that can detect and substitute sensitive data:

In [1]:
from langchain_experimental.data_anonymizer import PresidioAnonymizer

anonymizer = PresidioAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"],
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=0,
)

anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Sydney Davis, call me at 515-978-1565 or email me at tammy76@example.com'

`PresidioReversibleAnonymizer` is not significantly different from its predecessor in terms of anonymization:

In [2]:
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

anonymizer = PresidioReversibleAnonymizer(
    analyzed_fields=["PERSON", "PHONE_NUMBER", "EMAIL_ADDRESS", "CREDIT_CARD"],
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=0,
)

anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Sydney Davis, call me at 515-978-1565 or email me at tammy76@example.com'

What is new, is that it stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: 

In [3]:
anonymizer.deanonymizer_mapping

{'PERSON': {'Sydney Davis': 'Slim Shady'},
 'PHONE_NUMBER': {'515-978-1565': '313-666-7440'},
 'EMAIL_ADDRESS': {'tammy76@example.com': 'real.slim.shady@gmail.com'}}

Anonymizing more texts will result in new mapping entries:

In [4]:
print(
    anonymizer.anonymize(
        "Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way."
    )
)

anonymizer.deanonymizer_mapping

Do you have his VISA card number? Yep, it's 180014841858395. I'm Lisa Clayton by the way.


{'PERSON': {'Sydney Davis': 'Slim Shady', 'Lisa Clayton': 'John Doe'},
 'PHONE_NUMBER': {'515-978-1565': '313-666-7440'},
 'EMAIL_ADDRESS': {'tammy76@example.com': 'real.slim.shady@gmail.com'},
 'CREDIT_CARD': {'180014841858395': '4001 9192 5753 7193'}}

It's time to check the deanonymization process. By setting up a seed in Faker (only for testing purposes), we are able to determine fake values for each field and test the reversal of anonymization:

In [5]:
fake_name = "Sydney Davis"
fake_phone = "515-978-1565"
fake_email = "tammy76@example.com"
fake_credit_card = "180014841858395"

This is what the full string we want to deanonymize looks like:

In [6]:
anonymized_text = f"""{fake_name} recently lost his wallet. 
Inside is some cash and his credit card with the number {fake_credit_card}. 
If you would find it, please call him at {fake_phone} or email him: {fake_email}."""

print(anonymized_text)

Sydney Davis recently lost his wallet. 
Inside is some cash and his credit card with the number 180014841858395. 
If you would find it, please call him at 515-978-1565 or email him: tammy76@example.com.


And now, using the `deanonymize` method, we can reverse the process:

In [7]:
print(anonymizer.deanonymize(anonymized_text))

Slim Shady recently lost his wallet. 
Inside is some cash and his credit card with the number 4001 9192 5753 7193. 
If you would find it, please call him at 313-666-7440 or email him: real.slim.shady@gmail.com.


As you can see, the restored values agree with the mapping. Of course, these values are restored even if they repeatedly appear in the text, for example:

In [8]:
anonymized_text = f"{anonymized_text}\n{fake_name} will be very grateful!"
print(anonymizer.deanonymize(anonymized_text))

Slim Shady recently lost his wallet. 
Inside is some cash and his credit card with the number 4001 9192 5753 7193. 
If you would find it, please call him at 313-666-7440 or email him: real.slim.shady@gmail.com.
Slim Shady will be very grateful!


We can save the mapping itself to a file for future use: 

In [9]:
# We can save the deanonymizer mapping as a JSON or YAML file

anonymizer.save_deanonymizer_mapping("deanonymizer_mapping.json")
# anonymizer.save_deanonymizer_mapping("deanonymizer_mapping.yaml")

And then, load it in another `PresidioReversibleAnonymizer` instance:

In [10]:
anonymizer = PresidioReversibleAnonymizer()

anonymizer.deanonymizer_mapping

{}

In [11]:
anonymizer.load_deanonymizer_mapping("deanonymizer_mapping.json")

anonymizer.deanonymizer_mapping

{'PERSON': {'Sydney Davis': 'Slim Shady', 'Lisa Clayton': 'John Doe'},
 'PHONE_NUMBER': {'515-978-1565': '313-666-7440'},
 'EMAIL_ADDRESS': {'tammy76@example.com': 'real.slim.shady@gmail.com'},
 'CREDIT_CARD': {'180014841858395': '4001 9192 5753 7193'}}

Now, we will use the example from the previous section, in which we implemented anonymization in a sequence of chains. This is what it looked like before:

In [30]:
from langchain.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema.runnable import RunnablePassthrough

anonymizer = PresidioReversibleAnonymizer()

template = """According to this text, where can you find our super secret data?

{anonymized_text}

Answer:"""
prompt = PromptTemplate.from_template(template)
llm = ChatOpenAI()

chain = {"anonymized_text": anonymizer.anonymize} | prompt | llm
chain.invoke("You can find our super secret data at https://supersecretdata.com")

AIMessage(content='You can find our super secret data at https://www.avery-russell.net/', additional_kwargs={}, example=False)

Now, let's add **deanonymization step** to our sequence:

In [31]:
chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))
chain.invoke("You can find our super secret data at https://supersecretdata.com")

'According to the given text, you can find the super secret data at https://supersecretdata.com.'

Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one.

## Future works

- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.
- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs.