# Data anonymization

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_anonymization.ipynb)

## Use case

Data anonymization is crucial before passing information to a language model like GPT-4 because it helps protect privacy and maintain confidentiality. If data is not anonymized, sensitive information such as names, addresses, contact numbers, or other identifiers linked to specific individuals could potentially be learned and misused. Hence, by obscuring or removing this personally identifiable information (PII), data can be used freely without compromising individuals' privacy rights or breaching data protection laws and regulations.

## Overview

Anoninization consists of two steps:

1. **Identification:** Identify all data fields that contain personally identifiable information (PII).
2. **Replacement**: Replace all PIIs with pseudo values or codes that do not reveal any personal information about the individual but can be used for reference. We're not using regular encryption, because the language model won't be able to understand the meaning or context of the encrypted data.

We use *Microsoft Presidio* for anonymization because of the wide range of functionality available - it is implemented as `PresidioAnonymizer`. 

## Quickstart

Below you will find the use case on how to leverage anonymization in LangChain.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Import necessary packages
# ! pip install langchain langchain-experimental openai
# ! python -m spacy download en_core_web_lg

# Set env var OPENAI_API_KEY or load from a .env file:
import dotenv

dotenv.load_dotenv()

True

In [3]:
from langchain_experimental.data_anonymizer import PresidioAnonymizer

In [4]:
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON"])

In [5]:
anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Brenda Nelson, call me at 313-666-7440 or email me at real.slim.shady@gmail.com'

In [6]:
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER"])
anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Terry Lynch, call me at (489)935-0644x7328 or email me at real.slim.shady@gmail.com'

In [7]:
anonymizer = PresidioAnonymizer()
anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Joseph Vang, call me at (986)925-6310 or email me at harmonashley@example.net'

In [8]:
anonymizer = PresidioAnonymizer()
anonymizer.anonymize("My polish phone number is 666555444")

'My polish phone number is LCBN14037514713276'

In [9]:
# Define the regex pattern in a Presidio `Pattern` object:
from presidio_analyzer import Pattern, PatternRecognizer


polish_phone_numbers_pattern = Pattern(
    name="polish_phone_numbers_pattern",
    regex="(?<!\w)(\(?(\+|00)?48\)?)?[ -]?\d{3}[ -]?\d{3}[ -]?\d{3}(?!\w)",
    score=1,
)

# Define the recognizer with one or more patterns
polish_phone_numbers_recognizer = PatternRecognizer(
    supported_entity="POLISH_PHONE_NUMBER", patterns=[polish_phone_numbers_pattern]
)

In [10]:
anonymizer.add_recognizer(polish_phone_numbers_recognizer)

In [11]:
print(anonymizer.anonymize("My polish phone number is 666555444"))
print(anonymizer.anonymize("My polish phone number is 666 555 444"))
print(anonymizer.anonymize("My polish phone number is +48 666 555 444"))

My polish phone number is <POLISH_PHONE_NUMBER>
My polish phone number is <POLISH_PHONE_NUMBER>
My polish phone number is <POLISH_PHONE_NUMBER>


In [12]:
from faker import Faker

fake = Faker(locale="pl_PL")


def fake_polish_phone_number(_=None):
    return fake.phone_number()


fake_polish_phone_number()

'+48 32 615 90 45'

In [13]:
from presidio_anonymizer.entities import OperatorConfig

new_operators = {
    "POLISH_PHONE_NUMBER": OperatorConfig(
        "custom", {"lambda": fake_polish_phone_number}
    )
}

In [14]:
anonymizer.add_operators(new_operators)

In [15]:
anonymizer.anonymize("My polish phone number is 666555444")

'My polish phone number is 882 897 705'

In [16]:
from langchain.chains.transform import TransformChain

anonymizer = PresidioAnonymizer()


def anonymize_func(inputs: dict) -> dict:
    text = inputs["text"]
    return {"output_text": anonymizer.anonymize(text)}


anonymize_chain = TransformChain(
    input_variables=["text"], output_variables=["output_text"], transform=anonymize_func
)

anonymize_chain("You can find our super secret data at https://supersecretdata.com")

{'text': 'You can find our super secret data at https://supersecretdata.com',
 'output_text': 'You can find our super secret data at http://grant.com/'}

In [17]:
from langchain.chains import SimpleSequentialChain
from langchain.prompts.prompt import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.llms.openai import OpenAI

template = """According to this text, where can you find our super secret data?

{output_text}

Answer:"""
prompt = PromptTemplate(input_variables=["output_text"], template=template)
llm_chain = LLMChain(llm=OpenAI(), prompt=prompt)


sequential_chain = SimpleSequentialChain(chains=[anonymize_chain, llm_chain])
sequential_chain("You can find our super secret data at https://supersecretdata.com")

{'input': 'You can find our super secret data at https://supersecretdata.com',
 'output': ' https://www.brown-hunter.info/'}

## Future Works