# Anonimización de datos con Microsoft Presidio

[![Abrir en Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/guides/privacy/presidio_data_anonymization/index.ipynb)

>[Presidio](https://microsoft.github.io/presidio/) (Origen del latín praesidium ‘protección, guarnición’) ayuda a asegurar que los datos sensibles sean gestionados y gobernados adecuadamente. Proporciona módulos de identificación y anonimización rápidos para entidades privadas en texto e imágenes tales como números de tarjetas de crédito, nombres, ubicaciones, números de seguridad social, carteras de bitcoin, números de teléfono de EE.UU., datos financieros y más.

## Caso de uso

La anonimización de datos es mut importante antes de pasar información a un modelo de lenguaje como GPT-4 porque ayuda a proteger la privacidad y mantener la confidencialidad. Si los datos no son anonimizados, información sensible como nombres, direcciones, números de contacto u otros identificadores vinculados a individuos específicos podrían potencialmente ser aprendidos y mal utilizados. Por lo tanto, al oscurecer o eliminar esta información de identificación personal (PII), los datos pueden usarse libremente sin comprometer los derechos de privacidad de los individuos o infringir las leyes y regulaciones de protección de datos.

## Visión general

La anonimización consiste en dos pasos:

1. **Identificación:** Identificar todos los campos de datos que contienen información de identificación personal (PII).
2. **Reemplazo**: Reemplazar todas las PII con valores pseudo o códigos que no revelen ninguna información personal sobre el individuo pero que puedan ser utilizados para referencia. No estamos usando encriptación regular, porque el modelo de lenguaje no podrá entender el significado o contexto de los datos encriptados.

Usamos *Microsoft Presidio* junto con el marco *Faker* para propósitos de anonimización debido a la amplia gama de funcionalidades que proporcionan. La implementación completa está disponible en `PresidioAnonymizer`.

## Inicio rápido

A continuación, encontrará el caso de uso sobre cómo aprovechar la anonimización en LangChain.


In [2]:
%pip install --upgrade --quiet  langchain langchain-openai langchain-experimental presidio-analyzer presidio-anonymizer spacy Faker

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Download model
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


\
Veamos com funciona con una frase de ejemplo

In [4]:
from langchain_experimental.data_anonymizer import PresidioAnonymizer

anonymizer = PresidioAnonymizer()

anonymizer.anonymize(
    "Nombre es Pepito Pérez, llámame al 313-666-7440 o envíame un email a real.pepito-perez@gmail.com"
)

'Gene Fernandez, llámame al 001-840-921-5837x93999 o envíame un email a debbienelson@example.net'

### Con LangChain Expression Language

Ecadenamos con LCEL.

In [5]:
import os
from dotenv import load_dotenv

# Cargar variables de entorno
load_dotenv()
# Configurar el motor de OpenAI
api_key=os.getenv("OPENAI_API_KEY")

In [6]:
text = """Pepito Pérez perdió recientemente su cartera.
Dentro hay algo de efectivo y su tarjeta de crédito con el número 4916 0387 9536 0861.
Si lo encuentra, llame al 313-666-7440 o escriba un correo electrónico aquí: real.pepito.perez@gmail.com."""

In [7]:
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import ChatOpenAI

anonymizer = PresidioAnonymizer()

template = """Reescribe esto en un tono oficial, profesional, en un email corto email:

{anonymized_text}"""
prompt = PromptTemplate.from_template(template)
llm = ChatOpenAI(api_key=api_key, temperature=0)

chain = {"anonymized_text": anonymizer.anonymize} | prompt | llm
response = chain.invoke(text)
print(response.content)

Estimado señor Davenport,

Espero que este mensaje le encuentre bien. Me dirijo a usted en relación a la cartera de otoño Brewer, la cual contiene una suma de dinero y los documentos de identificación de Robert Hansen y Annette Thomas, cuyo número de tarjeta es 6519465005594771.

Le solicito amablemente que se comunique conmigo a través del número de teléfono +1-591-204-5684, extensión 3538, o bien, puede enviarme un correo electrónico a la siguiente dirección: pittsjeffrey@example.net.

Agradezco de antemano su pronta atención a este asunto.

Atentamente,

Jeffrey Pitts


## Customization
We can specify ``analyzed_fields`` to only anonymize particular types of data.

In [8]:
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON"])

anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Christina Adams, call me at 313-666-7440 or email me at real.slim.shady@gmail.com'

As can be observed, the name was correctly identified and replaced with another. The `analyzed_fields` attribute is responsible for what values are to be detected and substituted. We can add *PHONE_NUMBER* to the list:

In [9]:
anonymizer = PresidioAnonymizer(analyzed_fields=["PERSON", "PHONE_NUMBER"])
anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Robert Bell, call me at 001-203-334-8093 or email me at real.slim.shady@gmail.com'

\
If no analyzed_fields are specified, by default the anonymizer will detect all supported formats. Below is the full list of them:

`['PERSON', 'EMAIL_ADDRESS', 'PHONE_NUMBER', 'IBAN_CODE', 'CREDIT_CARD', 'CRYPTO', 'IP_ADDRESS', 'LOCATION', 'DATE_TIME', 'NRP', 'MEDICAL_LICENSE', 'URL', 'US_BANK_NUMBER', 'US_DRIVER_LICENSE', 'US_ITIN', 'US_PASSPORT', 'US_SSN']`

**Disclaimer:** We suggest carefully defining the private data to be detected - Presidio doesn't work perfectly and it sometimes makes mistakes, so it's better to have more control over the data.

In [10]:
anonymizer = PresidioAnonymizer()
anonymizer.anonymize(
    "My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com"
)

'My name is Willie Smith, call me at 3997682169 or email me at guerralauren@example.com'

\
It may be that the above list of detected fields is not sufficient. For example, the already available *PHONE_NUMBER* field does not support polish phone numbers and confuses it with another field:

In [11]:
anonymizer = PresidioAnonymizer()
anonymizer.anonymize("My polish phone number is 666555444")

'My polish phone number is 87942VK'

\
You can then write your own recognizers and add them to the pool of those present. How exactly to create recognizers is described in the [Presidio documentation](https://microsoft.github.io/presidio/samples/python/customizing_presidio_analyzer/).

In [12]:
# Define the regex pattern in a Presidio `Pattern` object:
from presidio_analyzer import Pattern, PatternRecognizer

polish_phone_numbers_pattern = Pattern(
    name="polish_phone_numbers_pattern",
    regex="(?<!\w)(\(?(\+|00)?48\)?)?[ -]?\d{3}[ -]?\d{3}[ -]?\d{3}(?!\w)",
    score=1,
)

# Define the recognizer with one or more patterns
polish_phone_numbers_recognizer = PatternRecognizer(
    supported_entity="POLISH_PHONE_NUMBER", patterns=[polish_phone_numbers_pattern]
)

\
Now, we can add recognizer by calling `add_recognizer` method on the anonymizer:

In [13]:
anonymizer.add_recognizer(polish_phone_numbers_recognizer)

\
And voilà! With the added pattern-based recognizer, the anonymizer now handles polish phone numbers.

In [14]:
print(anonymizer.anonymize("My polish phone number is 666555444"))
print(anonymizer.anonymize("My polish phone number is 666 555 444"))
print(anonymizer.anonymize("My polish phone number is +48 666 555 444"))

My polish phone number is <POLISH_PHONE_NUMBER>
My polish phone number is <POLISH_PHONE_NUMBER>
My polish phone number is <POLISH_PHONE_NUMBER>


\
The problem is - even though we recognize polish phone numbers now, we don't have a method (operator) that would tell how to substitute a given field - because of this, in the outpit we only provide string `<POLISH_PHONE_NUMBER>` We need to create a method to replace it correctly:

In [15]:
from faker import Faker

fake = Faker(locale="pl_PL")


def fake_polish_phone_number(_=None):
    return fake.phone_number()


fake_polish_phone_number()

'+48 662 922 337'

\
We used Faker to create pseudo data. Now we can create an operator and add it to the anonymizer. For complete information about operators and their creation, see the Presidio documentation for [simple](https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/) and [custom](https://microsoft.github.io/presidio/tutorial/11_custom_anonymization/) anonymization.

In [16]:
from presidio_anonymizer.entities import OperatorConfig

new_operators = {
    "POLISH_PHONE_NUMBER": OperatorConfig(
        "custom", {"lambda": fake_polish_phone_number}
    )
}

In [17]:
anonymizer.add_operators(new_operators)

In [18]:
anonymizer.anonymize("My polish phone number is 666555444")

'My polish phone number is +48 510 831 870'

## Important considerations

### Anonymizer detection rates

**The level of anonymization and the precision of detection are just as good as the quality of the recognizers implemented.**

Texts from different sources and in different languages have varying characteristics, so it is necessary to test the detection precision and iteratively add recognizers and operators to achieve better and better results.

Microsoft Presidio gives a lot of freedom to refine anonymization. The library's author has provided his [recommendations and a step-by-step guide for improving detection rates](https://github.com/microsoft/presidio/discussions/767#discussion-3567223).

### Instance anonymization

`PresidioAnonymizer` has no built-in memory. Therefore, two occurrences of the entity in the subsequent texts will be replaced with two different fake values:

In [19]:
print(anonymizer.anonymize("My name is John Doe. Hi John Doe!"))
print(anonymizer.anonymize("My name is John Doe. Hi John Doe!"))

My name is Deanna Williams. Hi Deanna Williams!
My name is Ashley Baker. Hi Ashley Baker!


To preserve previous anonymization results, use `PresidioReversibleAnonymizer`, which has built-in memory:

In [20]:
from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer

anonymizer_with_memory = PresidioReversibleAnonymizer()

print(anonymizer_with_memory.anonymize("My name is John Doe. Hi John Doe!"))
print(anonymizer_with_memory.anonymize("My name is John Doe. Hi John Doe!"))

My name is James Brown. Hi James Brown!
My name is James Brown. Hi James Brown!


You can learn more about `PresidioReversibleAnonymizer` in the next section.