From 274c3dc3a82d5248859e09bccec1acef682335a4 Mon Sep 17 00:00:00 2001 From: maks-operlejn-ds <142261444+maks-operlejn-ds@users.noreply.github.com> Date: Thu, 7 Sep 2023 23:42:24 +0200 Subject: [PATCH 1/3] Multilingual anonymization (#10327) ### Description Add multiple language support to Anonymizer PII detection in Microsoft Presidio relies on several components - in addition to the usual pattern matching (e.g. using regex), the analyser uses a model for Named Entity Recognition (NER) to extract entities such as: - `PERSON` - `LOCATION` - `DATE_TIME` - `NRP` - `ORGANIZATION` [[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py) To handle NER in specific languages, we utilize unique models from the `spaCy` library, recognized for its extensive selection covering multiple languages and sizes. However, it's not restrictive, allowing for integration of alternative frameworks such as [Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/) or [transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/) when necessary. ### Future works - **automatic language detection** - instead of passing the language as a parameter in `anonymizer.anonymize`, we could detect the language/s beforehand and then use the corresponding NER model. We have discussed this internally and @mateusz-wosinski-ds will look into a standalone language detection tool/chain for LangChain :smile: ### Twitter handle @deepsense_ai / @MaksOpp ### Tag maintainer @baskaryan @hwchase17 @hinthornw --- ...b => 01_presidio_data_anonymization.ipynb} | 4 +- ...02_presidio_reversible_anonymization.ipynb | 461 ++++++++++++++++ ...residio_multi_language_anonymization.ipynb | 520 ++++++++++++++++++ .../presidio_reversible_anonymization.ipynb | 461 ---------------- .../data_anonymizer/base.py | 7 +- .../data_anonymizer/faker_presidio_mapping.py | 4 +- .../data_anonymizer/presidio.py | 71 ++- 7 files changed, 1053 insertions(+), 475 deletions(-) rename docs/extras/guides/privacy/{presidio_data_anonymization.ipynb => 01_presidio_data_anonymization.ipynb} (97%) create mode 100644 docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb create mode 100644 docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb delete mode 100644 docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb diff --git a/docs/extras/guides/privacy/presidio_data_anonymization.ipynb b/docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb similarity index 97% rename from docs/extras/guides/privacy/presidio_data_anonymization.ipynb rename to docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb index 4b4b718e29b9..c06157c1187d 100644 --- a/docs/extras/guides/privacy/presidio_data_anonymization.ipynb +++ b/docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb @@ -6,7 +6,7 @@ "source": [ "# Data anonymization with Microsoft Presidio\n", "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization.ipynb)\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb)\n", "\n", "## Use case\n", "\n", @@ -439,8 +439,6 @@ "metadata": {}, "source": [ "## Future works\n", - "\n", - "- **deanonymization** - add the ability to reverse anonymization. For example, the workflow could look like this: `anonymize -> LLMChain -> deanonymize`. By doing this, we will retain anonymity in requests to, for example, OpenAI, and then be able restore the original data.\n", "- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object." ] } diff --git a/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb b/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb new file mode 100644 index 000000000000..4c75523969a6 --- /dev/null +++ b/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb @@ -0,0 +1,461 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reversible data anonymization with Microsoft Presidio\n", + "\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb)\n", + "\n", + "\n", + "## Use case\n", + "\n", + "We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. \n", + "\n", + "## Overview\n", + "\n", + "We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:\n", + "\n", + "1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:\n", + "```\n", + " {\n", + " \"PERSON\": {\n", + " \"\": \"\",\n", + " \"John Doe\": \"Slim Shady\"\n", + " },\n", + " \"PHONE_NUMBER\": {\n", + " \"111-111-1111\": \"555-555-5555\"\n", + " }\n", + " ...\n", + " }\n", + "```\n", + "\n", + "2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.\n", + "\n", + "Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.\n", + "\n", + "## Quickstart\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Install necessary packages\n", + "# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n", + "# ! python -m spacy download en_core_web_lg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`PresidioReversibleAnonymizer` is not significantly different from its predecessor (`PresidioAnonymizer`) in terms of anonymization:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'My name is Maria Lynch, call me at 7344131647 or email me at jamesmichael@example.com. By the way, my card number is: 4838637940262'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", + " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", + " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", + " faker_seed=42,\n", + ")\n", + "\n", + "anonymizer.anonymize(\n", + " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", + " \"By the way, my card number is: 4916 0387 9536 0861\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is what the full string we want to deanonymize looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Maria Lynch recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4838637940262. \n", + "If you would find it, please call at 7344131647 or write an email here: jamesmichael@example.com.\n", + "Maria Lynch would be very grateful!\n" + ] + } + ], + "source": [ + "# We know this data, as we set the faker_seed parameter\n", + "fake_name = \"Maria Lynch\"\n", + "fake_phone = \"7344131647\"\n", + "fake_email = \"jamesmichael@example.com\"\n", + "fake_credit_card = \"4838637940262\"\n", + "\n", + "anonymized_text = f\"\"\"{fake_name} recently lost his wallet. \n", + "Inside is some cash and his credit card with the number {fake_credit_card}. \n", + "If you would find it, please call at {fake_phone} or write an email here: {fake_email}.\n", + "{fake_name} would be very grateful!\"\"\"\n", + "\n", + "print(anonymized_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, using the `deanonymize` method, we can reverse the process:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Slim Shady recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", + "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\n", + "Slim Shady would be very grateful!\n" + ] + } + ], + "source": [ + "print(anonymizer.deanonymize(anonymized_text))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using with LangChain Expression Language\n", + "\n", + "With LCEL we can easily chain together anonymization and deanonymization with the rest of our application. This is an example of using the anonymization mechanism with a query to LLM (without deanonymization for now):" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "text = f\"\"\"Slim Shady recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", + "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dear Sir/Madam,\n", + "\n", + "We regret to inform you that Mr. Dana Rhodes has reported the loss of his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4397528473885757. \n", + "\n", + "If you happen to come across the aforementioned wallet, we kindly request that you contact us immediately at 258-481-7074x714 or via email at laurengoodman@example.com.\n", + "\n", + "Your prompt assistance in this matter would be greatly appreciated.\n", + "\n", + "Yours faithfully,\n", + "\n", + "[Your Name]\n" + ] + } + ], + "source": [ + "from langchain.prompts.prompt import PromptTemplate\n", + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer()\n", + "\n", + "template = \"\"\"Rewrite this text into an official, short email:\n", + "\n", + "{anonymized_text}\"\"\"\n", + "prompt = PromptTemplate.from_template(template)\n", + "llm = ChatOpenAI(temperature=0)\n", + "\n", + "chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n", + "response = chain.invoke(text)\n", + "print(response.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's add **deanonymization step** to our sequence:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dear Sir/Madam,\n", + "\n", + "We regret to inform you that Mr. Slim Shady has recently misplaced his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4916 0387 9536 0861. \n", + "\n", + "If by any chance you come across the lost wallet, kindly contact us immediately at 313-666-7440 or send an email to real.slim.shady@gmail.com.\n", + "\n", + "Your prompt assistance in this matter would be greatly appreciated.\n", + "\n", + "Yours faithfully,\n", + "\n", + "[Your Name]\n" + ] + } + ], + "source": [ + "chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))\n", + "response = chain.invoke(text)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extra knowledge" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`PresidioReversibleAnonymizer` stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861'}}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", + " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", + " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", + " faker_seed=42,\n", + ")\n", + "\n", + "anonymizer.anonymize(\n", + " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", + " \"By the way, my card number is: 4916 0387 9536 0861\"\n", + ")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Anonymizing more texts will result in new mapping entries:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Do you have his VISA card number? Yep, it's 3537672423884966. I'm William Bowman by the way.\n" + ] + }, + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", + " '3537672423884966': '4001 9192 5753 7193'}}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(\n", + " anonymizer.anonymize(\n", + " \"Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way.\"\n", + " )\n", + ")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can save the mapping itself to a file for future use: " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# We can save the deanonymizer mapping as a JSON or YAML file\n", + "\n", + "anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", + "# anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.yaml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And then, load it in another `PresidioReversibleAnonymizer` instance:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer = PresidioReversibleAnonymizer()\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", + " '3537672423884966': '4001 9192 5753 7193'}}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer.load_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Future works\n", + "\n", + "- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.\n", + "- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file diff --git a/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb b/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb new file mode 100644 index 000000000000..c6c144ebae9a --- /dev/null +++ b/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb @@ -0,0 +1,520 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mutli-language data anonymization with Microsoft Presidio\n", + "\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb)\n", + "\n", + "\n", + "## Use case\n", + "\n", + "Multi-language support in data pseudonymization is essential due to differences in language structures and cultural contexts. Different languages may have varying formats for personal identifiers. For example, the structure of names, locations and dates can differ greatly between languages and regions. Furthermore, non-alphanumeric characters, accents, and the direction of writing can impact pseudonymization processes. Without multi-language support, data could remain identifiable or be misinterpreted, compromising data privacy and accuracy. Hence, it enables effective and precise pseudonymization suited for global operations.\n", + "\n", + "## Overview\n", + "\n", + "PII detection in Microsoft Presidio relies on several components - in addition to the usual pattern matching (e.g. using regex), the analyser uses a model for Named Entity Recognition (NER) to extract entities such as:\n", + "- `PERSON`\n", + "- `LOCATION`\n", + "- `DATE_TIME`\n", + "- `NRP`\n", + "- `ORGANIZATION`\n", + "\n", + "[[Source]](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)\n", + "\n", + "To handle NER in specific languages, we utilize unique models from the `spaCy` library, recognized for its extensive selection covering multiple languages and sizes. However, it's not restrictive, allowing for integration of alternative frameworks such as [Stanza](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/) or [transformers](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/) when necessary.\n", + "\n", + "\n", + "## Quickstart\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Install necessary packages\n", + "# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n", + "# ! python -m spacy download en_core_web_lg" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, `PresidioAnonymizer` and `PresidioReversibleAnonymizer` use a model trained on English texts, so they handle other languages moderately well. \n", + "\n", + "For example, here the model did not detect the person:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Me llamo Sofía'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer.anonymize(\"Me llamo Sofía\") # \"My name is Sofía\" in Spanish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "They may also take words from another language as actual entities. Here, both the word *'Yo'* (*'I'* in Spanish) and *Sofía* have been classified as `PERSON`:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'Bridget Kirk soy Sally Knight'" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer.anonymize(\"Yo soy Sofía\") # \"I am Sofía\" in Spanish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to anonymise texts from other languages, you need to download other models and add them to the anonymiser configuration:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "# Download the models for the languages you want to use\n", + "# ! python -m spacy download en_core_web_md\n", + "# ! python -m spacy download es_core_news_md" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "nlp_config = {\n", + " \"nlp_engine_name\": \"spacy\",\n", + " \"models\": [\n", + " {\"lang_code\": \"en\", \"model_name\": \"en_core_web_md\"},\n", + " {\"lang_code\": \"es\", \"model_name\": \"es_core_news_md\"},\n", + " ],\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We have therefore added a Spanish language model. Note also that we have downloaded an alternative model for English as well - in this case we have replaced the large model `en_core_web_lg` (560MB) with its smaller version `en_core_web_md` (40MB) - the size is therefore reduced by 14 times! If you care about the speed of anonymisation, it is worth considering it.\n", + "\n", + "All models for the different languages can be found in the [spaCy documentation](https://spacy.io/usage/models).\n", + "\n", + "Now pass the configuration as the `languages_config` parameter to Anonymiser. As you can see, both previous examples work flawlessly:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Me llamo Michelle Smith\n", + "Yo soy Rachel Wright\n" + ] + } + ], + "source": [ + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\"],\n", + " languages_config=nlp_config,\n", + ")\n", + "\n", + "print(\n", + " anonymizer.anonymize(\"Me llamo Sofía\", language=\"es\")\n", + ") # \"My name is Sofía\" in Spanish\n", + "print(anonymizer.anonymize(\"Yo soy Sofía\", language=\"es\")) # \"I am Sofía\" in Spanish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, the language indicated first in the configuration will be used when anonymising text (in this case English):" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "My name is Ronnie Ayala\n" + ] + } + ], + "source": [ + "print(anonymizer.anonymize(\"My name is John\"))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Advanced usage\n", + "\n", + "### Custom labels in NER model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It may be that the spaCy model has different class names than those supported by the Microsoft Presidio by default. Take Polish, for example:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Text: Wiktoria, Start: 12, End: 20, Label: persName\n" + ] + } + ], + "source": [ + "# ! python -m spacy download pl_core_news_md\n", + "\n", + "import spacy\n", + "\n", + "nlp = spacy.load(\"pl_core_news_md\")\n", + "doc = nlp(\"Nazywam się Wiktoria\") # \"My name is Wiktoria\" in Polish\n", + "\n", + "for ent in doc.ents:\n", + " print(\n", + " f\"Text: {ent.text}, Start: {ent.start_char}, End: {ent.end_char}, Label: {ent.label_}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The name *Victoria* was classified as `persName`, which does not correspond to the default class names `PERSON`/`PER` implemented in Microsoft Presidio (look for `CHECK_LABEL_GROUPS` in [SpacyRecognizer implementation](https://github.com/microsoft/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/spacy_recognizer.py)). \n", + "\n", + "You can find out more about custom labels in spaCy models (including your own, trained ones) in [this thread](https://github.com/microsoft/presidio/issues/851).\n", + "\n", + "That's why our sentence will not be anonymized:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nazywam się Wiktoria\n" + ] + } + ], + "source": [ + "nlp_config = {\n", + " \"nlp_engine_name\": \"spacy\",\n", + " \"models\": [\n", + " {\"lang_code\": \"en\", \"model_name\": \"en_core_web_md\"},\n", + " {\"lang_code\": \"es\", \"model_name\": \"es_core_news_md\"},\n", + " {\"lang_code\": \"pl\", \"model_name\": \"pl_core_news_md\"},\n", + " ],\n", + "}\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\", \"LOCATION\", \"DATE_TIME\"],\n", + " languages_config=nlp_config,\n", + ")\n", + "\n", + "print(\n", + " anonymizer.anonymize(\"Nazywam się Wiktoria\", language=\"pl\")\n", + ") # \"My name is Wiktoria\" in Polish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To address this, create your own `SpacyRecognizer` with your own class mapping and add it to the anonymizer:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "from presidio_analyzer.predefined_recognizers import SpacyRecognizer\n", + "\n", + "polish_check_label_groups = [\n", + " ({\"LOCATION\"}, {\"placeName\", \"geogName\"}),\n", + " ({\"PERSON\"}, {\"persName\"}),\n", + " ({\"DATE_TIME\"}, {\"date\", \"time\"}),\n", + "]\n", + "\n", + "spacy_recognizer = SpacyRecognizer(\n", + " supported_language=\"pl\",\n", + " check_label_groups=polish_check_label_groups,\n", + ")\n", + "\n", + "anonymizer.add_recognizer(spacy_recognizer)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now everything works smoothly:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nazywam się Morgan Walters\n" + ] + } + ], + "source": [ + "print(\n", + " anonymizer.anonymize(\"Nazywam się Wiktoria\", language=\"pl\")\n", + ") # \"My name is Wiktoria\" in Polish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's try on more complex example:" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nazywam się Ernest Liu. New Taylorburgh to moje miasto rodzinne. Urodziłam się 1987-01-19\n" + ] + } + ], + "source": [ + "print(\n", + " anonymizer.anonymize(\n", + " \"Nazywam się Wiktoria. Płock to moje miasto rodzinne. Urodziłam się dnia 6 kwietnia 2001 roku\",\n", + " language=\"pl\",\n", + " )\n", + ") # \"My name is Wiktoria. Płock is my home town. I was born on 6 April 2001\" in Polish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, thanks to class mapping, the anonymiser can cope with different types of entities. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Custom language-specific operators\n", + "\n", + "In the example above, the sentence has been anonymised correctly, but the fake data does not fit the Polish language at all. Custom operators can therefore be added, which will resolve the issue:" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "from faker import Faker\n", + "from presidio_anonymizer.entities import OperatorConfig\n", + "\n", + "fake = Faker(locale=\"pl_PL\") # Setting faker to provide Polish data\n", + "\n", + "new_operators = {\n", + " \"PERSON\": OperatorConfig(\"custom\", {\"lambda\": lambda _: fake.first_name_female()}),\n", + " \"LOCATION\": OperatorConfig(\"custom\", {\"lambda\": lambda _: fake.city()}),\n", + "}\n", + "\n", + "anonymizer.add_operators(new_operators)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Nazywam się Marianna. Szczecin to moje miasto rodzinne. Urodziłam się 1976-11-16\n" + ] + } + ], + "source": [ + "print(\n", + " anonymizer.anonymize(\n", + " \"Nazywam się Wiktoria. Płock to moje miasto rodzinne. Urodziłam się dnia 6 kwietnia 2001 roku\",\n", + " language=\"pl\",\n", + " )\n", + ") # \"My name is Wiktoria. Płock is my home town. I was born on 6 April 2001\" in Polish" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Limitations\n", + "\n", + "Remember - results are as good as your recognizers and as your NER models!\n", + "\n", + "Look at the example below - we downloaded the small model for Spanish (12MB) and it no longer performs as well as the medium version (40MB):" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: es_core_news_sm. Result: Me llamo Sofía\n", + "Model: es_core_news_md. Result: Me llamo Lawrence Davis\n" + ] + } + ], + "source": [ + "# ! python -m spacy download es_core_news_sm\n", + "\n", + "for model in [\"es_core_news_sm\", \"es_core_news_md\"]:\n", + " nlp_config = {\n", + " \"nlp_engine_name\": \"spacy\",\n", + " \"models\": [\n", + " {\"lang_code\": \"es\", \"model_name\": model},\n", + " ],\n", + " }\n", + "\n", + " anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\"],\n", + " languages_config=nlp_config,\n", + " )\n", + "\n", + " print(\n", + " f\"Model: {model}. Result: {anonymizer.anonymize('Me llamo Sofía', language='es')}\"\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In many cases, even the larger models from spaCy will not be sufficient - there are already other, more complex and better methods of detecting named entities, based on transformers. You can read more about this [here](https://microsoft.github.io/presidio/analyzer/nlp_engines/transformers/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Future works\n", + "\n", + "- **automatic language detection** - instead of passing the language as a parameter in `anonymizer.anonymize`, we could detect the language/s beforehand and then use the corresponding NER model." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb b/docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb deleted file mode 100644 index 480b26327804..000000000000 --- a/docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb +++ /dev/null @@ -1,461 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reversible data anonymization with Microsoft Presidio\n", - "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_reversible_anonymization.ipynb)\n", - "\n", - "\n", - "## Use case\n", - "\n", - "We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. \n", - "\n", - "## Overview\n", - "\n", - "We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:\n", - "\n", - "1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:\n", - "```\n", - " {\n", - " \"PERSON\": {\n", - " \"\": \"\",\n", - " \"John Doe\": \"Slim Shady\"\n", - " },\n", - " \"PHONE_NUMBER\": {\n", - " \"111-111-1111\": \"555-555-5555\"\n", - " }\n", - " ...\n", - " }\n", - "```\n", - "\n", - "2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.\n", - "\n", - "Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.\n", - "\n", - "## Quickstart\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Install necessary packages\n", - "# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n", - "# ! python -m spacy download en_core_web_lg" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`PresidioReversibleAnonymizer` is not significantly different from its predecessor (`PresidioAnonymizer`) in terms of anonymization:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'My name is Maria Lynch, call me at 7344131647 or email me at jamesmichael@example.com. By the way, my card number is: 4838637940262'" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer(\n", - " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", - " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", - " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", - " faker_seed=42,\n", - ")\n", - "\n", - "anonymizer.anonymize(\n", - " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", - " \"By the way, my card number is: 4916 0387 9536 0861\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is what the full string we want to deanonymize looks like:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Maria Lynch recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4838637940262. \n", - "If you would find it, please call at 7344131647 or write an email here: jamesmichael@example.com.\n", - "Maria Lynch would be very grateful!\n" - ] - } - ], - "source": [ - "# We know this data, as we set the faker_seed parameter\n", - "fake_name = \"Maria Lynch\"\n", - "fake_phone = \"7344131647\"\n", - "fake_email = \"jamesmichael@example.com\"\n", - "fake_credit_card = \"4838637940262\"\n", - "\n", - "anonymized_text = f\"\"\"{fake_name} recently lost his wallet. \n", - "Inside is some cash and his credit card with the number {fake_credit_card}. \n", - "If you would find it, please call at {fake_phone} or write an email here: {fake_email}.\n", - "{fake_name} would be very grateful!\"\"\"\n", - "\n", - "print(anonymized_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And now, using the `deanonymize` method, we can reverse the process:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Slim Shady recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", - "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\n", - "Slim Shady would be very grateful!\n" - ] - } - ], - "source": [ - "print(anonymizer.deanonymize(anonymized_text))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Using with LangChain Expression Language\n", - "\n", - "With LCEL we can easily chain together anonymization and deanonymization with the rest of our application. This is an example of using the anonymization mechanism with a query to LLM (without deanonymization for now):" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "text = f\"\"\"Slim Shady recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", - "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dear Sir/Madam,\n", - "\n", - "We regret to inform you that Mr. Dana Rhodes has reported the loss of his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4397528473885757. \n", - "\n", - "If you happen to come across the aforementioned wallet, we kindly request that you contact us immediately at 258-481-7074x714 or via email at laurengoodman@example.com.\n", - "\n", - "Your prompt assistance in this matter would be greatly appreciated.\n", - "\n", - "Yours faithfully,\n", - "\n", - "[Your Name]\n" - ] - } - ], - "source": [ - "from langchain.prompts.prompt import PromptTemplate\n", - "from langchain.chat_models import ChatOpenAI\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer()\n", - "\n", - "template = \"\"\"Rewrite this text into an official, short email:\n", - "\n", - "{anonymized_text}\"\"\"\n", - "prompt = PromptTemplate.from_template(template)\n", - "llm = ChatOpenAI(temperature=0)\n", - "\n", - "chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n", - "response = chain.invoke(text)\n", - "print(response.content)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's add **deanonymization step** to our sequence:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dear Sir/Madam,\n", - "\n", - "We regret to inform you that Mr. Slim Shady has recently misplaced his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4916 0387 9536 0861. \n", - "\n", - "If by any chance you come across the lost wallet, kindly contact us immediately at 313-666-7440 or send an email to real.slim.shady@gmail.com.\n", - "\n", - "Your prompt assistance in this matter would be greatly appreciated.\n", - "\n", - "Yours faithfully,\n", - "\n", - "[Your Name]\n" - ] - } - ], - "source": [ - "chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))\n", - "response = chain.invoke(text)\n", - "print(response)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extra knowledge" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`PresidioReversibleAnonymizer` stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861'}}" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer(\n", - " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", - " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", - " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", - " faker_seed=42,\n", - ")\n", - "\n", - "anonymizer.anonymize(\n", - " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", - " \"By the way, my card number is: 4916 0387 9536 0861\"\n", - ")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Anonymizing more texts will result in new mapping entries:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Do you have his VISA card number? Yep, it's 3537672423884966. I'm William Bowman by the way.\n" - ] - }, - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", - " '3537672423884966': '4001 9192 5753 7193'}}" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "print(\n", - " anonymizer.anonymize(\n", - " \"Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way.\"\n", - " )\n", - ")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can save the mapping itself to a file for future use: " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# We can save the deanonymizer mapping as a JSON or YAML file\n", - "\n", - "anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", - "# anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.yaml\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And then, load it in another `PresidioReversibleAnonymizer` instance:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{}" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "anonymizer = PresidioReversibleAnonymizer()\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", - " '3537672423884966': '4001 9192 5753 7193'}}" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "anonymizer.load_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Future works\n", - "\n", - "- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.\n", - "- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/libs/experimental/langchain_experimental/data_anonymizer/base.py b/libs/experimental/langchain_experimental/data_anonymizer/base.py index 875032342a7a..292d2a2a0f69 100644 --- a/libs/experimental/langchain_experimental/data_anonymizer/base.py +++ b/libs/experimental/langchain_experimental/data_anonymizer/base.py @@ -1,4 +1,5 @@ from abc import ABC, abstractmethod +from typing import Optional class AnonymizerBase(ABC): @@ -8,12 +9,12 @@ class AnonymizerBase(ABC): wrapping the behavior for all methods in a base class. """ - def anonymize(self, text: str) -> str: + def anonymize(self, text: str, language: Optional[str] = None) -> str: """Anonymize text""" - return self._anonymize(text) + return self._anonymize(text, language) @abstractmethod - def _anonymize(self, text: str) -> str: + def _anonymize(self, text: str, language: Optional[str]) -> str: """Abstract method to anonymize text""" diff --git a/libs/experimental/langchain_experimental/data_anonymizer/faker_presidio_mapping.py b/libs/experimental/langchain_experimental/data_anonymizer/faker_presidio_mapping.py index c2a339088e99..9015679f2005 100644 --- a/libs/experimental/langchain_experimental/data_anonymizer/faker_presidio_mapping.py +++ b/libs/experimental/langchain_experimental/data_anonymizer/faker_presidio_mapping.py @@ -27,8 +27,8 @@ def get_pseudoanonymizer_mapping(seed: Optional[int] = None) -> Dict[str, Callab fake.random_choices(string.ascii_lowercase + string.digits, length=26) ), "IP_ADDRESS": lambda _: fake.ipv4_public(), - "LOCATION": lambda _: fake.address(), - "DATE_TIME": lambda _: fake.iso8601(), + "LOCATION": lambda _: fake.city(), + "DATE_TIME": lambda _: fake.date(), "NRP": lambda _: str(fake.random_number(digits=8, fix_len=True)), "MEDICAL_LICENSE": lambda _: fake.bothify(text="??######").upper(), "URL": lambda _: fake.url(), diff --git a/libs/experimental/langchain_experimental/data_anonymizer/presidio.py b/libs/experimental/langchain_experimental/data_anonymizer/presidio.py index d4886eb32c1c..b2be1dc5a1c0 100644 --- a/libs/experimental/langchain_experimental/data_anonymizer/presidio.py +++ b/libs/experimental/langchain_experimental/data_anonymizer/presidio.py @@ -24,6 +24,8 @@ try: from presidio_analyzer import AnalyzerEngine + from presidio_analyzer.nlp_engine import NlpEngineProvider + except ImportError as e: raise ImportError( "Could not import presidio_analyzer, please install with " @@ -44,12 +46,29 @@ from presidio_analyzer import EntityRecognizer, RecognizerResult from presidio_anonymizer.entities import EngineResult +# Configuring Anonymizer for multiple languages +# Detailed description and examples can be found here: +# langchain/docs/extras/guides/privacy/multi_language_anonymization.ipynb +DEFAULT_LANGUAGES_CONFIG = { + # You can also use Stanza or transformers library. + # See https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/ + "nlp_engine_name": "spacy", + "models": [ + {"lang_code": "en", "model_name": "en_core_web_lg"}, + # {"lang_code": "de", "model_name": "de_core_news_md"}, + # {"lang_code": "es", "model_name": "es_core_news_md"}, + # ... + # List of available models: https://spacy.io/usage/models + ], +} + class PresidioAnonymizerBase(AnonymizerBase): def __init__( self, analyzed_fields: Optional[List[str]] = None, operators: Optional[Dict[str, OperatorConfig]] = None, + languages_config: Dict = DEFAULT_LANGUAGES_CONFIG, faker_seed: Optional[int] = None, ): """ @@ -60,6 +79,11 @@ def __init__( Operators allow for custom anonymization of detected PII. Learn more: https://microsoft.github.io/presidio/tutorial/10_simple_anonymization/ + languages_config: Configuration for the NLP engine. + First language in the list will be used as the main language + in self.anonymize(...) when no language is specified. + Learn more: + https://microsoft.github.io/presidio/analyzer/customizing_nlp_models/ faker_seed: Seed used to initialize faker. Defaults to None, in which case faker will be seeded randomly and provide random values. @@ -81,7 +105,15 @@ def __init__( ).items() } ) - self._analyzer = AnalyzerEngine() + + provider = NlpEngineProvider(nlp_configuration=languages_config) + nlp_engine = provider.create_engine() + + self.supported_languages = list(nlp_engine.nlp.keys()) + + self._analyzer = AnalyzerEngine( + supported_languages=self.supported_languages, nlp_engine=nlp_engine + ) self._anonymizer = AnonymizerEngine() def add_recognizer(self, recognizer: EntityRecognizer) -> None: @@ -103,18 +135,31 @@ def add_operators(self, operators: Dict[str, OperatorConfig]) -> None: class PresidioAnonymizer(PresidioAnonymizerBase): - def _anonymize(self, text: str) -> str: + def _anonymize(self, text: str, language: Optional[str] = None) -> str: """Anonymize text. Each PII entity is replaced with a fake value. Each time fake values will be different, as they are generated randomly. Args: text: text to anonymize + language: language to use for analysis of PII + If None, the first (main) language in the list + of languages specified in the configuration will be used. """ + if language is None: + language = self.supported_languages[0] + + if language not in self.supported_languages: + raise ValueError( + f"Language '{language}' is not supported. " + f"Supported languages are: {self.supported_languages}. " + "Change your language configuration file to add more languages." + ) + results = self._analyzer.analyze( text, entities=self.analyzed_fields, - language="en", + language=language, ) return self._anonymizer.anonymize( @@ -129,9 +174,10 @@ def __init__( self, analyzed_fields: Optional[List[str]] = None, operators: Optional[Dict[str, OperatorConfig]] = None, + languages_config: Dict = DEFAULT_LANGUAGES_CONFIG, faker_seed: Optional[int] = None, ): - super().__init__(analyzed_fields, operators, faker_seed) + super().__init__(analyzed_fields, operators, languages_config, faker_seed) self._deanonymizer_mapping = DeanonymizerMapping() @property @@ -191,7 +237,7 @@ def _update_deanonymizer_mapping( self._deanonymizer_mapping.update(new_deanonymizer_mapping) - def _anonymize(self, text: str) -> str: + def _anonymize(self, text: str, language: Optional[str] = None) -> str: """Anonymize text. Each PII entity is replaced with a fake value. Each time fake values will be different, as they are generated randomly. @@ -200,11 +246,24 @@ def _anonymize(self, text: str) -> str: Args: text: text to anonymize + language: language to use for analysis of PII + If None, the first (main) language in the list + of languages specified in the configuration will be used. """ + if language is None: + language = self.supported_languages[0] + + if language not in self.supported_languages: + raise ValueError( + f"Language '{language}' is not supported. " + f"Supported languages are: {self.supported_languages}. " + "Change your language configuration file to add more languages." + ) + analyzer_results = self._analyzer.analyze( text, entities=self.analyzed_fields, - language="en", + language=language, ) filtered_analyzer_results = ( From 1d2b6c3c67fdd5dbd3f9161001703fae40668a71 Mon Sep 17 00:00:00 2001 From: Bagatur Date: Thu, 7 Sep 2023 14:45:07 -0700 Subject: [PATCH 2/3] Reorganize presidio anonymization docs --- .../index.ipynb} | 0 .../multi_language.ipynb} | 0 .../reversible.ipynb} | 0 3 files changed, 0 insertions(+), 0 deletions(-) rename docs/extras/guides/privacy/{01_presidio_data_anonymization.ipynb => presidio_data_anonymization/index.ipynb} (100%) rename docs/extras/guides/privacy/{03_presidio_multi_language_anonymization.ipynb => presidio_data_anonymization/multi_language.ipynb} (100%) rename docs/extras/guides/privacy/{02_presidio_reversible_anonymization.ipynb => presidio_data_anonymization/reversible.ipynb} (100%) diff --git a/docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb similarity index 100% rename from docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb rename to docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb diff --git a/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb similarity index 100% rename from docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb rename to docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb diff --git a/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb similarity index 100% rename from docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb rename to docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb From 41a25486113c51ad1ed9c348843fbc480242fe7f Mon Sep 17 00:00:00 2001 From: Bagatur Date: Thu, 7 Sep 2023 14:47:09 -0700 Subject: [PATCH 3/3] Fix presidio docs Colab links --- .../presidio_data_anonymization/index.ipynb | 4 +- .../multi_language.ipynb | 4 +- .../reversible.ipynb | 918 +++++++++--------- 3 files changed, 463 insertions(+), 463 deletions(-) diff --git a/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb index c06157c1187d..2502a4509224 100644 --- a/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb +++ b/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb @@ -6,7 +6,7 @@ "source": [ "# Data anonymization with Microsoft Presidio\n", "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/01_presidio_data_anonymization.ipynb)\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/index.ipynb)\n", "\n", "## Use case\n", "\n", @@ -459,7 +459,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.9.1" } }, "nbformat": 4, diff --git a/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb index c6c144ebae9a..63ba8931a60b 100644 --- a/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb +++ b/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb @@ -6,7 +6,7 @@ "source": [ "# Mutli-language data anonymization with Microsoft Presidio\n", "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/03_presidio_multi_language_anonymization.ipynb)\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/multi_language.ipynb)\n", "\n", "\n", "## Use case\n", @@ -512,7 +512,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.4" + "version": "3.9.1" } }, "nbformat": 4, diff --git a/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb b/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb index 4c75523969a6..de5655ba1e9d 100644 --- a/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb +++ b/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb @@ -1,461 +1,461 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Reversible data anonymization with Microsoft Presidio\n", - "\n", - "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/02_presidio_reversible_anonymization.ipynb)\n", - "\n", - "\n", - "## Use case\n", - "\n", - "We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. \n", - "\n", - "## Overview\n", - "\n", - "We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:\n", - "\n", - "1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:\n", - "```\n", - " {\n", - " \"PERSON\": {\n", - " \"\": \"\",\n", - " \"John Doe\": \"Slim Shady\"\n", - " },\n", - " \"PHONE_NUMBER\": {\n", - " \"111-111-1111\": \"555-555-5555\"\n", - " }\n", - " ...\n", - " }\n", - "```\n", - "\n", - "2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.\n", - "\n", - "Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.\n", - "\n", - "## Quickstart\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Install necessary packages\n", - "# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n", - "# ! python -m spacy download en_core_web_lg" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`PresidioReversibleAnonymizer` is not significantly different from its predecessor (`PresidioAnonymizer`) in terms of anonymization:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'My name is Maria Lynch, call me at 7344131647 or email me at jamesmichael@example.com. By the way, my card number is: 4838637940262'" - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer(\n", - " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", - " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", - " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", - " faker_seed=42,\n", - ")\n", - "\n", - "anonymizer.anonymize(\n", - " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", - " \"By the way, my card number is: 4916 0387 9536 0861\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This is what the full string we want to deanonymize looks like:" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Maria Lynch recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4838637940262. \n", - "If you would find it, please call at 7344131647 or write an email here: jamesmichael@example.com.\n", - "Maria Lynch would be very grateful!\n" - ] - } - ], - "source": [ - "# We know this data, as we set the faker_seed parameter\n", - "fake_name = \"Maria Lynch\"\n", - "fake_phone = \"7344131647\"\n", - "fake_email = \"jamesmichael@example.com\"\n", - "fake_credit_card = \"4838637940262\"\n", - "\n", - "anonymized_text = f\"\"\"{fake_name} recently lost his wallet. \n", - "Inside is some cash and his credit card with the number {fake_credit_card}. \n", - "If you would find it, please call at {fake_phone} or write an email here: {fake_email}.\n", - "{fake_name} would be very grateful!\"\"\"\n", - "\n", - "print(anonymized_text)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And now, using the `deanonymize` method, we can reverse the process:" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Slim Shady recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", - "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\n", - "Slim Shady would be very grateful!\n" - ] - } - ], - "source": [ - "print(anonymizer.deanonymize(anonymized_text))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Using with LangChain Expression Language\n", - "\n", - "With LCEL we can easily chain together anonymization and deanonymization with the rest of our application. This is an example of using the anonymization mechanism with a query to LLM (without deanonymization for now):" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "text = f\"\"\"Slim Shady recently lost his wallet. \n", - "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", - "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\"" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dear Sir/Madam,\n", - "\n", - "We regret to inform you that Mr. Dana Rhodes has reported the loss of his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4397528473885757. \n", - "\n", - "If you happen to come across the aforementioned wallet, we kindly request that you contact us immediately at 258-481-7074x714 or via email at laurengoodman@example.com.\n", - "\n", - "Your prompt assistance in this matter would be greatly appreciated.\n", - "\n", - "Yours faithfully,\n", - "\n", - "[Your Name]\n" - ] - } - ], - "source": [ - "from langchain.prompts.prompt import PromptTemplate\n", - "from langchain.chat_models import ChatOpenAI\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer()\n", - "\n", - "template = \"\"\"Rewrite this text into an official, short email:\n", - "\n", - "{anonymized_text}\"\"\"\n", - "prompt = PromptTemplate.from_template(template)\n", - "llm = ChatOpenAI(temperature=0)\n", - "\n", - "chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n", - "response = chain.invoke(text)\n", - "print(response.content)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, let's add **deanonymization step** to our sequence:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Dear Sir/Madam,\n", - "\n", - "We regret to inform you that Mr. Slim Shady has recently misplaced his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4916 0387 9536 0861. \n", - "\n", - "If by any chance you come across the lost wallet, kindly contact us immediately at 313-666-7440 or send an email to real.slim.shady@gmail.com.\n", - "\n", - "Your prompt assistance in this matter would be greatly appreciated.\n", - "\n", - "Yours faithfully,\n", - "\n", - "[Your Name]\n" - ] - } - ], - "source": [ - "chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))\n", - "response = chain.invoke(text)\n", - "print(response)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Extra knowledge" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "`PresidioReversibleAnonymizer` stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: " - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861'}}" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", - "\n", - "anonymizer = PresidioReversibleAnonymizer(\n", - " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", - " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", - " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", - " faker_seed=42,\n", - ")\n", - "\n", - "anonymizer.anonymize(\n", - " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", - " \"By the way, my card number is: 4916 0387 9536 0861\"\n", - ")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Anonymizing more texts will result in new mapping entries:" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Do you have his VISA card number? Yep, it's 3537672423884966. I'm William Bowman by the way.\n" - ] - }, - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", - " '3537672423884966': '4001 9192 5753 7193'}}" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "print(\n", - " anonymizer.anonymize(\n", - " \"Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way.\"\n", - " )\n", - ")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can save the mapping itself to a file for future use: " - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "# We can save the deanonymizer mapping as a JSON or YAML file\n", - "\n", - "anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", - "# anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.yaml\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "And then, load it in another `PresidioReversibleAnonymizer` instance:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{}" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "anonymizer = PresidioReversibleAnonymizer()\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", - " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", - " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", - " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", - " '3537672423884966': '4001 9192 5753 7193'}}" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "anonymizer.load_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", - "\n", - "anonymizer.deanonymizer_mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Future works\n", - "\n", - "- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.\n", - "- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Reversible data anonymization with Microsoft Presidio\n", + "\n", + "[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/guides/privacy/presidio_data_anonymization/reversible.ipynb)\n", + "\n", + "\n", + "## Use case\n", + "\n", + "We have already written about the importance of anonymizing sensitive data in the previous section. **Reversible Anonymization** is an equally essential technology while sharing information with language models, as it balances data protection with data usability. This technique involves masking sensitive personally identifiable information (PII), yet it can be reversed and original data can be restored when authorized users need it. Its main advantage lies in the fact that while it conceals individual identities to prevent misuse, it also allows the concealed data to be accurately unmasked should it be necessary for legal or compliance purposes. \n", + "\n", + "## Overview\n", + "\n", + "We implemented the `PresidioReversibleAnonymizer`, which consists of two parts:\n", + "\n", + "1. anonymization - it works the same way as `PresidioAnonymizer`, plus the object itself stores a mapping of made-up values to original ones, for example:\n", + "```\n", + " {\n", + " \"PERSON\": {\n", + " \"\": \"\",\n", + " \"John Doe\": \"Slim Shady\"\n", + " },\n", + " \"PHONE_NUMBER\": {\n", + " \"111-111-1111\": \"555-555-5555\"\n", + " }\n", + " ...\n", + " }\n", + "```\n", + "\n", + "2. deanonymization - using the mapping described above, it matches fake data with original data and then substitutes it.\n", + "\n", + "Between anonymization and deanonymization user can perform different operations, for example, passing the output to LLM.\n", + "\n", + "## Quickstart\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# Install necessary packages\n", + "# ! pip install langchain langchain-experimental openai presidio-analyzer presidio-anonymizer spacy Faker\n", + "# ! python -m spacy download en_core_web_lg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`PresidioReversibleAnonymizer` is not significantly different from its predecessor (`PresidioAnonymizer`) in terms of anonymization:" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'My name is Maria Lynch, call me at 7344131647 or email me at jamesmichael@example.com. By the way, my card number is: 4838637940262'" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", + " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", + " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", + " faker_seed=42,\n", + ")\n", + "\n", + "anonymizer.anonymize(\n", + " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", + " \"By the way, my card number is: 4916 0387 9536 0861\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This is what the full string we want to deanonymize looks like:" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Maria Lynch recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4838637940262. \n", + "If you would find it, please call at 7344131647 or write an email here: jamesmichael@example.com.\n", + "Maria Lynch would be very grateful!\n" + ] + } + ], + "source": [ + "# We know this data, as we set the faker_seed parameter\n", + "fake_name = \"Maria Lynch\"\n", + "fake_phone = \"7344131647\"\n", + "fake_email = \"jamesmichael@example.com\"\n", + "fake_credit_card = \"4838637940262\"\n", + "\n", + "anonymized_text = f\"\"\"{fake_name} recently lost his wallet. \n", + "Inside is some cash and his credit card with the number {fake_credit_card}. \n", + "If you would find it, please call at {fake_phone} or write an email here: {fake_email}.\n", + "{fake_name} would be very grateful!\"\"\"\n", + "\n", + "print(anonymized_text)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And now, using the `deanonymize` method, we can reverse the process:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Slim Shady recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", + "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\n", + "Slim Shady would be very grateful!\n" + ] + } + ], + "source": [ + "print(anonymizer.deanonymize(anonymized_text))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Using with LangChain Expression Language\n", + "\n", + "With LCEL we can easily chain together anonymization and deanonymization with the rest of our application. This is an example of using the anonymization mechanism with a query to LLM (without deanonymization for now):" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "text = f\"\"\"Slim Shady recently lost his wallet. \n", + "Inside is some cash and his credit card with the number 4916 0387 9536 0861. \n", + "If you would find it, please call at 313-666-7440 or write an email here: real.slim.shady@gmail.com.\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dear Sir/Madam,\n", + "\n", + "We regret to inform you that Mr. Dana Rhodes has reported the loss of his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4397528473885757. \n", + "\n", + "If you happen to come across the aforementioned wallet, we kindly request that you contact us immediately at 258-481-7074x714 or via email at laurengoodman@example.com.\n", + "\n", + "Your prompt assistance in this matter would be greatly appreciated.\n", + "\n", + "Yours faithfully,\n", + "\n", + "[Your Name]\n" + ] + } + ], + "source": [ + "from langchain.prompts.prompt import PromptTemplate\n", + "from langchain.chat_models import ChatOpenAI\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer()\n", + "\n", + "template = \"\"\"Rewrite this text into an official, short email:\n", + "\n", + "{anonymized_text}\"\"\"\n", + "prompt = PromptTemplate.from_template(template)\n", + "llm = ChatOpenAI(temperature=0)\n", + "\n", + "chain = {\"anonymized_text\": anonymizer.anonymize} | prompt | llm\n", + "response = chain.invoke(text)\n", + "print(response.content)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let's add **deanonymization step** to our sequence:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Dear Sir/Madam,\n", + "\n", + "We regret to inform you that Mr. Slim Shady has recently misplaced his wallet. The wallet contains a sum of cash and his credit card, bearing the number 4916 0387 9536 0861. \n", + "\n", + "If by any chance you come across the lost wallet, kindly contact us immediately at 313-666-7440 or send an email to real.slim.shady@gmail.com.\n", + "\n", + "Your prompt assistance in this matter would be greatly appreciated.\n", + "\n", + "Yours faithfully,\n", + "\n", + "[Your Name]\n" + ] + } + ], + "source": [ + "chain = chain | (lambda ai_message: anonymizer.deanonymize(ai_message.content))\n", + "response = chain.invoke(text)\n", + "print(response)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Anonymized data was given to the model itself, and therefore it was protected from being leaked to the outside world. Then, the model's response was processed, and the factual value was replaced with the real one." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Extra knowledge" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`PresidioReversibleAnonymizer` stores the mapping of the fake values to the original values in the `deanonymizer_mapping` parameter, where key is fake PII and value is the original one: " + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861'}}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_experimental.data_anonymizer import PresidioReversibleAnonymizer\n", + "\n", + "anonymizer = PresidioReversibleAnonymizer(\n", + " analyzed_fields=[\"PERSON\", \"PHONE_NUMBER\", \"EMAIL_ADDRESS\", \"CREDIT_CARD\"],\n", + " # Faker seed is used here to make sure the same fake data is generated for the test purposes\n", + " # In production, it is recommended to remove the faker_seed parameter (it will default to None)\n", + " faker_seed=42,\n", + ")\n", + "\n", + "anonymizer.anonymize(\n", + " \"My name is Slim Shady, call me at 313-666-7440 or email me at real.slim.shady@gmail.com. \"\n", + " \"By the way, my card number is: 4916 0387 9536 0861\"\n", + ")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Anonymizing more texts will result in new mapping entries:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Do you have his VISA card number? Yep, it's 3537672423884966. I'm William Bowman by the way.\n" + ] }, - "nbformat": 4, - "nbformat_minor": 4 -} \ No newline at end of file + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", + " '3537672423884966': '4001 9192 5753 7193'}}" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print(\n", + " anonymizer.anonymize(\n", + " \"Do you have his VISA card number? Yep, it's 4001 9192 5753 7193. I'm John Doe by the way.\"\n", + " )\n", + ")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can save the mapping itself to a file for future use: " + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "# We can save the deanonymizer mapping as a JSON or YAML file\n", + "\n", + "anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", + "# anonymizer.save_deanonymizer_mapping(\"deanonymizer_mapping.yaml\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And then, load it in another `PresidioReversibleAnonymizer` instance:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{}" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer = PresidioReversibleAnonymizer()\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PERSON': {'Maria Lynch': 'Slim Shady', 'William Bowman': 'John Doe'},\n", + " 'PHONE_NUMBER': {'7344131647': '313-666-7440'},\n", + " 'EMAIL_ADDRESS': {'jamesmichael@example.com': 'real.slim.shady@gmail.com'},\n", + " 'CREDIT_CARD': {'4838637940262': '4916 0387 9536 0861',\n", + " '3537672423884966': '4001 9192 5753 7193'}}" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "anonymizer.load_deanonymizer_mapping(\"deanonymizer_mapping.json\")\n", + "\n", + "anonymizer.deanonymizer_mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Future works\n", + "\n", + "- **instance anonymization** - at this point, each occurrence of PII is treated as a separate entity and separately anonymized. Therefore, two occurrences of the name John Doe in the text will be changed to two different names. It is therefore worth introducing support for full instance detection, so that repeated occurrences are treated as a single object.\n", + "- **better matching and substitution of fake values for real ones** - currently the strategy is based on matching full strings and then substituting them. Due to the indeterminism of language models, it may happen that the value in the answer is slightly changed (e.g. *John Doe* -> *John* or *Main St, New York* -> *New York*) and such a substitution is then no longer possible. Therefore, it is worth adjusting the matching for your needs." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.1" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}