# QA with private data protection

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb)


[TODO: opis]


## Quickstart

### Iterative process of upgrading the anonymizer

In [1]:
# Install necessary packages
# !pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker faiss-cpu tiktoken
# ! python -m spacy download en_core_web_lg

In [3]:
from langchain.document_loaders import TextLoader

loader = TextLoader("text_with_private_data.txt")

documents = loader.load_and_split()
len(documents)

1

In [4]:
document_content = documents[0].page_content

In [5]:
print(document_content)

Date: October 19, 2021
Witness: Maks Operlejn
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Maks Operlejn and on October 19, 2021, my wallet was stolen in the vicinity of Kilmarnock during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, PL61109010140000071219812874.

Additionally, the wallet had a driver's license - DL No: 999000680 issued to my name. It also houses my Social Security Number, 602-76-4532. 

What's more, I had my polish identity card there, with the number ABC123456.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at 9:30 AM.

In case any information arises regarding my wallet, please reach out to me on my phone number, 999-888-7777, or through my personal email, maksoperlejn@example.com.

Please consider this i

In [None]:
# Util function for coloring the PII markers
import re


def print_colored_pii(string):
    colored_string = re.sub(
        r"(<[^>]*>)", lambda m: "\033[31m" + m.group(1) + "\033[0m", string
    )
    print(colored_string)

In [6]:
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=False,
)

print_colored_pii(anonymizer.anonymize(document_content))

Date: [31m<DATE_TIME>[0m
Witness: [31m<PERSON>[0m
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is [31m<PERSON>[0m and on [31m<DATE_TIME>[0m, my wallet was stolen in the vicinity of [31m<LOCATION>[0m during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, [31m<IBAN_CODE>[0m.

Additionally, the wallet had a driver's license - DL No: [31m<US_DRIVER_LICENSE>[0m issued to my name. It also houses my Social Security Number, [31m<US_SSN>[0m. 

What's more, I had my polish identity card there, with the number ABC123456.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<DATE_TIME_2>[0m.

In case any information arises regarding my wallet, please reach out to me on my phone number, [31m<PHONE_NUMBER>[0m, or through 

In [7]:
import pprint

pprint.pprint(anonymizer.deanonymizer_mapping)

{'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021', '<DATE_TIME_2>': '9:30 AM'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'maksoperlejn@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'Maks Operlejn', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777',
                  '<PHONE_NUMBER_2>': '987-654-3210'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}


In [8]:
# Define the regex pattern in a Presidio `Pattern` object:
from presidio_analyzer import Pattern, PatternRecognizer


polish_id_pattern = Pattern(
    name="polish_id_pattern",
    regex="[A-Z]{3}\d{6}",
    score=1,
)
time_pattern = Pattern(
    name="time_pattern",
    regex="(1[0-2]|0?[1-9]):[0-5][0-9] (AM|PM)",
    score=1,
)

# Define the recognizer with one or more patterns
polish_id_recognizer = PatternRecognizer(
    supported_entity="POLISH_ID", patterns=[polish_id_pattern]
)
time_recognizer = PatternRecognizer(supported_entity="TIME", patterns=[time_pattern])

In [9]:
anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

In [10]:
anonymizer.reset_deanonymizer_mapping()

In [11]:
print_colored_pii(anonymizer.anonymize(document_content))

Date: [31m<DATE_TIME>[0m
Witness: [31m<PERSON>[0m
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is [31m<PERSON>[0m and on [31m<DATE_TIME>[0m, my wallet was stolen in the vicinity of [31m<LOCATION>[0m during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, [31m<IBAN_CODE>[0m.

Additionally, the wallet had a driver's license - DL No: [31m<US_DRIVER_LICENSE>[0m issued to my name. It also houses my Social Security Number, [31m<US_SSN>[0m. 

What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.

In case any information arises regarding my wallet, please reach out to me on my phone number, [31m<PHONE_NUMBER>[0m, or thro

In [12]:
pprint.pprint(anonymizer.deanonymizer_mapping)

{'DATE_TIME': {'<DATE_TIME>': 'October 19, 2021'},
 'EMAIL_ADDRESS': {'<EMAIL_ADDRESS>': 'maksoperlejn@example.com',
                   '<EMAIL_ADDRESS_2>': 'support@bankname.com'},
 'IBAN_CODE': {'<IBAN_CODE>': 'PL61109010140000071219812874'},
 'LOCATION': {'<LOCATION>': 'Kilmarnock'},
 'PERSON': {'<PERSON>': 'Maks Operlejn', '<PERSON_2>': 'Victoria Cherry'},
 'PHONE_NUMBER': {'<PHONE_NUMBER>': '999-888-7777',
                  '<PHONE_NUMBER_2>': '987-654-3210'},
 'POLISH_ID': {'<POLISH_ID>': 'ABC123456'},
 'TIME': {'<TIME>': '9:30 AM'},
 'US_DRIVER_LICENSE': {'<US_DRIVER_LICENSE>': '999000680'},
 'US_SSN': {'<US_SSN>': '602-76-4532'}}


In [13]:
anonymizer = PresidioReversibleAnonymizer(
    add_default_faker_operators=True,
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=42,
)

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

print_colored_pii(anonymizer.anonymize(document_content))

Date: 1995-06-12
Witness: Mark Lynch
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Mark Lynch and on 1995-06-12, my wallet was stolen in the vicinity of Kaylamouth during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, GB71HEQP10122691669784.

Additionally, the wallet had a driver's license - DL No: 870854579 issued to my name. It also houses my Social Security Number, 275-09-3457. 

What's more, I had my polish identity card there, with the number [31m<POLISH_ID>[0m.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at [31m<TIME>[0m.

In case any information arises regarding my wallet, please reach out to me on my phone number, +1-528-232-7648x350, or through my personal email, kendragalloway@example.org.

Please consider th

In [14]:
from faker import Faker

fake = Faker()


def fake_polish_id(_=None):
    return fake.bothify(text="???######").upper()


fake_polish_id()

'VNY913049'

In [15]:
def fake_time(_=None):
    return fake.time(pattern="%I:%M %p")


fake_time()

'03:03 PM'

In [16]:
from presidio_anonymizer.entities import OperatorConfig

new_operators = {
    "POLISH_ID": OperatorConfig("custom", {"lambda": fake_polish_id}),
    "TIME": OperatorConfig("custom", {"lambda": fake_time}),
}

anonymizer.add_operators(new_operators)

In [17]:
anonymizer.reset_deanonymizer_mapping()
print_colored_pii(anonymizer.anonymize(document_content))

Date: 2008-06-18
Witness: Lisa Barnes
Subject: Testimony Regarding the Loss of Wallet

Testimony Content:

Hello Officer,

My name is Lisa Barnes and on 2008-06-18, my wallet was stolen in the vicinity of Shawhaven during a bike trip. This wallet contains some very important things to me.

Firstly, the wallet contains my credit card with number 5412 5412 5412 5412, which is registered under my name and linked to my bank account, GB37GMBF60647468723430.

Additionally, the wallet had a driver's license - DL No: 250519597 issued to my name. It also houses my Social Security Number, 197-25-8787. 

What's more, I had my polish identity card there, with the number JDO417982.

I would like this data to be secured and protected in all possible ways. I believe It was stolen at 06:10 PM.

In case any information arises regarding my wallet, please reach out to me on my phone number, 863.711.6566x7010, or through my personal email, perezrebecca@example.com.

Please consider this information to be 

### Question-answering system with PII anonymization

In [18]:
anonymizer = PresidioReversibleAnonymizer(
    # Faker seed is used here to make sure the same fake data is generated for the test purposes
    # In production, it is recommended to remove the faker_seed parameter (it will default to None)
    faker_seed=42,
)

anonymizer.add_recognizer(polish_id_recognizer)
anonymizer.add_recognizer(time_recognizer)

anonymizer.add_operators(new_operators)

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

loader = TextLoader("text_with_pii.txt")
documents = loader.load()

for doc in documents:
    doc.page_content = anonymizer.anonymize(doc.page_content)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
docsearch = FAISS.from_documents(chunks, embeddings)
retriever = docsearch.as_retriever()

In [20]:
from operator import itemgetter
from langchain.chat_models.openai import ChatOpenAI
from langchain.schema.runnable import RunnableMap
from langchain.prompts import ChatPromptTemplate
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough


template = """Answer the question based only on the following context:
{context}

Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(temperature=0.3)

chain = (
    RunnableMap({"question": RunnablePassthrough()})
    | {
        "context": itemgetter("question") | retriever,
        "anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
    }
    | prompt
    | model
    | StrOutputParser()
)

In [21]:
chain.invoke(
    "Where did the theft of the wallet occur, at what time, and who was it stolen from?"
)

'The theft of the wallet occurred in the vicinity of Kaylamouth during a bike trip. It was stolen from Mark Lynch at 04:19 AM.'

In [22]:
from langchain.schema.runnable import RunnableLambda

chain_with_deanonymization = chain | RunnableLambda(anonymizer.deanonymize)

chain_with_deanonymization.invoke(
    "Where did the theft of the wallet occur, at what time, and who was it stolen from?"
)

'The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from Maks Operlejn. The time of the theft was 9:30 AM.'

In [23]:
chain_with_deanonymization.invoke("What was the content of the wallet?")

"The content of the wallet included a credit card with the number 5412 5412 5412 5412, a driver's license with the number 999000680, a Social Security Number of 602-76-4532, and a Polish identity card with the number ABC123456."

In [24]:
chain_with_deanonymization.invoke("How can the victim be contacted?")

'The victim can be contacted through their phone number, 999-888-7777, or through their personal email, maksoperlejn@example.com.'

### Alternative approach: local embeddings + anonymizing the context after indexing

In [25]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-base-en-v1.5"
# model_kwargs = {'device': 'cuda'}
encode_kwargs = {"normalize_embeddings": True}  # set True to compute cosine similarity
local_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    # model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
    query_instruction="Represent this sentence for searching relevant passages:",
)

In [26]:
loader = TextLoader("text_with_pii.txt")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = text_splitter.split_documents(documents)

docsearch = FAISS.from_documents(chunks, local_embeddings)
retriever = docsearch.as_retriever()

In [27]:
template = """Answer the question based only on the following context:
{context}

Question: {anonymized_question}
"""
prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI(temperature=0.2)

In [28]:
from langchain.prompts.prompt import PromptTemplate
from langchain.schema import format_document

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")


def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return document_separator.join(doc_strings)


chain_with_deanonymization = (
    RunnableMap({"question": RunnablePassthrough()})
    | {
        "context": itemgetter("question")
        | retriever
        | _combine_documents
        | anonymizer.anonymize,
        "question": itemgetter("question"),
        "anonymized_question": lambda x: anonymizer.anonymize(x["question"]),
    }
    | prompt
    | model
    | StrOutputParser()
    | RunnableLambda(anonymizer.deanonymize)
)

In [29]:
chain_with_deanonymization.invoke(
    "Where did the theft of the wallet occur, at what time, and who was it stolen from?"
)

'The theft of the wallet occurred in the vicinity of Kilmarnock during a bike trip. It was stolen from Maks Operlejn. The time of the theft was 9:30 AM.'

In [30]:
chain_with_deanonymization.invoke("What was the content of the wallet?")

"The content of the wallet included a credit card, a driver's license, a Social Security Number, and a Polish identity card."

In [31]:
chain_with_deanonymization.invoke("How can the victim be contacted?")

'The victim can be contacted through their personal email, maksoperlejn@example.com, or their phone number, 999-888-7777.'