Skip to content
Context aware, pluggable and customizable data protection and anonymization service for text and images
Python Go Makefile Dockerfile Shell Smarty
Branch: master
Clone or download
navalev and omri374 Update build.sh with configuration per service (#260)
* update build.sh with configuration per service
* update deployment md
Latest commit f7c9b5d Jan 19, 2020
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Update pull_request_template.md Oct 29, 2019
charts/presidio more merge fixes Sep 3, 2019
deployment Change deployment script to mcr (#130) May 1, 2019
docs Update build.sh with configuration per service (#260) Jan 19, 2020
pipelines fix var names Jun 2, 2019
pkg Interpretability information tracing (#185) Jul 15, 2019
presidio-analyzer Crypto analyzer fix 241 (#259) Jan 14, 2020
presidio-anonymizer-image Image support (#87) Jan 8, 2019
presidio-anonymizer Default transformation (#94) May 1, 2019
presidio-api Fix When request has both template and a template ID, the template is… Sep 22, 2019
presidio-collector Interpretability information tracing (#185) Jul 15, 2019
presidio-datasink Create redis cache for templates (#85) Jan 8, 2019
presidio-ocr Image support (#87) Jan 8, 2019
presidio-recognizers-store Bug 911 fix (#131) May 1, 2019
presidio-scheduler Create redis cache for templates (#85) Jan 8, 2019
presidio-tester adding context list to the custom recognizer integration test Jun 2, 2019
tests a post-upgrade e2e test (#112) May 1, 2019
.dockerignore Version 0.1.0 (#78) Dec 18, 2018
.editorconfig Analyzer redesign + supporting custom recognizers May 1, 2019
.gitignore Interpretability information tracing (#185) Jul 15, 2019
AUTHORS Updated readme and documentation(#125) May 1, 2019
CONTRIBUTING.md Separated best practices for recognizers development from CONTRIBUTIN… Dec 30, 2019
Dockerfile.golang.base CI as code Jun 2, 2019
Dockerfile.golang.deps Image support (#87) Jan 8, 2019
Dockerfile.python.deps Analyzer container size reduction + faster builds (#252) Jan 1, 2020
Gopkg.lock reference presidio genproto master branch Sep 3, 2019
Gopkg.toml reference presidio genproto master branch Sep 3, 2019
LICENSE Initial presidio version Oct 10, 2018
Makefile Analyzer container size reduction + faster builds (#252) Jan 1, 2020
NOTICE Version 0.1.1 (#82) Dec 19, 2018
README.MD References to the research repo (#256) Jan 14, 2020
SECURITY.MD Create SECURITY.MD (#217) Sep 9, 2019
azure-pipelines.yml Version 0.1.0 (#78) Dec 18, 2018
build.sh Update build.sh with configuration per service (#260) Jan 19, 2020
gometalinter.json Analyzer redesign + supporting custom recognizers May 1, 2019
pytest.ini Analyzer container size reduction + faster builds (#252) Jan 1, 2020

README.MD

Presidio - Data protection and anonymization API

Context aware, pluggable and customizable PII anonymization service for text and images.


Build Status Go Report Card MIT license Release

Description

Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive text is properly managed and governed. It provides fast analytics and anonymization for sensitive text such as credit card numbers, names, locations, social security numbers, bitcoin wallets, US phone numbers and financial data. Presidio analyzes the text using predefined or custom recognizers to identify entities, patterns, formats, and checksums with relevant context. Presidio leverages docker and kubernetes for workloads at scale.

Why use presidio?

Presidio can be integrated into any data pipeline for intelligent PII scrubbing. It is open-source, transparent and scalable. Additionally, PII anonymization use-cases often require a different set of PII entities to be detected, some of which are domain or business specific. Presidio allows you to customize or add new PII recognizers via API or code to best fit your anonymization needs.

⚠️ Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

Demo

Try Presidio with your own data

Features

Unstructured text anonymization

Presidio automatically detects Personal-Identifiable Information (PII) in unstructured text, annonymizes it based on one or more anonymization mechanisms, and returns a string with no personal identifiable data. For example:

Image1

For each PII entity, presidio returns a confidence score:

Image2

Text anonymization in images (beta)

Presidio uses OCR to detect text in images. It further allows the redaction of the text from the original image.

Image3

Learn more

More information could be found in the Presidio documentation.

Input and output

Presidio accepts multiple sources and targets for data annonymization. Specifically:

  1. Storage solutions

    • Azure Blob Storage
    • S3
    • Google Cloud Storage
  2. Databases

    • MySQL
    • PostgreSQL
    • Sql Server
    • Oracle
  3. Streaming platforms

    • Kafka
    • Azure Events Hubs
  4. REST requests

It then can export the results to file storage, databases or streaming platforms.

The Technology Stack

Presidio leverages:

Quickstart

  1. Install Presidio
  2. Decide on a name for your Presidio project. In the following examples the project name is <my-project>.
  3. Start using the Presidio analyze and anonymize services.

Samples

Note: Examples are made with HTTPie

Sample 1: Simple text analysis

echo -n '{"text":"John Smith lives in New York. We met yesterday morning in Seattle. I called him before on (212) 555-1234 to verify the appointment. He also told me that his drivers license is AC333991", "analyzeTemplate":{"allFields":true}  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 2: Create reusable templates

  1. Create an analyzer template:

    echo -n '{"allFields":true}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
  2. Analyze text:

    echo -n '{"text":"my credit card number is 2970-84746760-9907 345954225667833 4961-2765-5327-5913", "AnalyzeTemplateId":"<my-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 3: Detect specific entities

  1. Create an analyzer project with a specific set of entities:

    echo -n '{"fields":[{"name":"PHONE_NUMBER"}, {"name":"LOCATION"}, {"name":"DATE_TIME"}]}' | http <api-service-address>/api/v1/templates/<my-project>/analyze/<my-template-name>
  2. Analyze text:

    echo -n '{"text":"We met yesterday morning in Seattle and his phone number is (212) 555 1234", "AnalyzeTemplateId":"<my-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 4: Custom anonymization

  1. Create an anonymizer template (This template replaces values in PHONE_NUMBER and redacts CREDIT_CARD):

    echo -n '{"fieldTypeTransformations":[{"fields":[{"name":"PHONE_NUMBER"}],"transformation":{"replaceValue":{"newValue":"\u003cphone-number\u003e"}}},{"fields":[{"name":"CREDIT_CARD"}],"transformation":{"redactValue":{}}}]}' | http <api-service-address>/api/v1/templates/<my-project>/anonymize/<my-anonymize-template-name>
  2. Anonymize text:

    echo -n '{"text":"my phone number is 057-555-2323 and my credit card is 4961-2765-5327-5913", "AnalyzeTemplateId":"<my-analyze-template-name>", "AnonymizeTemplateId":"<my-anonymize-template-name>"  }' | http <api-service-address>/api/v1/projects/<my-project>/anonymize

Sample 5: Add custom PII entity recognizer

This sample shows how to add an new regex recognizer via API. This simple recognizer identifies the word "rocket" in a text and tags it as a "ROCKET entity.

  1. Add a custom recognizer

    echo -n {"value": {"entity": "ROCKET","language": "en", "patterns": [{"name": "rocket-regex","regex": "\\W*(rocket)\\W*","score": 1}]}} | http <api-service-address>/api/v1/analyzer/recognizers/rocket
  2. Analyze text:

    echo -n '{"text":"They sent a rocket to the moon!", "analyzeTemplate":{"allFields":true}  }' | http <api-service-address>/api/v1/projects/<my-project>/analyze

Sample 6: Image anonymization

  1. Create an anonymizer image template (This template redacts values with black color):

    echo -n '{"fieldTypeGraphics":[{"graphic":{"fillColorValue":{"blue":0,"red":0,"green":0}}}]}' | http <api-service-address>/api/v1/templates/<my-project>/anonymize-image/<my-anonymize-image-template-name>
  2. Anonymize image:

    http -f POST <api-service-address>/api/v1/projects/<my-project>/anonymize-image detectionType='OCR' analyzeTemplateId='<my-analyze-template-name>' anonymizeImageTemplateId='<my-anonymize-image-template-name>' imageType='image/png' file@~/test-ocr.png > test-output.png

Single click deployment using the default values

The script will install Presidio on your Kubenetes cluster. Prerequesites:

  1. A Kubernetes cluster with RBAC enabled

  2. kubectl installed

    • verify you can communicate with the cluster by running:

      kubectl version
  3. Local helm client.

  4. Recent presidio repo is cloned on your local machine.

Installation Steps

  1. Navigate into <root>\deployment from command line.

  2. If You have helm installed, but havn't run helm init, execute deploy-helm.sh in the command line. It will install tiller (helm server side) on your cluster, and grant it sufficient permissions.

  3. Grant the Kubernetes cluster access to the container registry

  4. If you already have helm and tiller configured, or if you installed it in the previous step, execute deploy-presidio.sh in the command line as follows:

deploy-presidio.sh

The script will install Presidio on your cluster using the default values.

Note: You can edit the file to use your own container registry and image.

Current input/output components status

Module Feature Status
API HTTP input
Scanner MySQL
Scanner MSSQL
Scanner PostgreSQL
Scanner Oracle
Scanner Azure Blob Storage
Scanner S3
Scanner Google Cloud Storage
Streams Kafka
Streams Azure Event Hub
Datasink (output) MySQL
Datasink (output) MSSQL
Datasink (output) Oracle
Datasink (output) PostgreSQL
Datasink (output) Kafka
Datasink (output) Azure Event Hub
Datasink (output) Azure Blob Storage
Datasink (output) S3
Datasink (output) Google Cloud Storage
  • - Working
  • 🔶 - Partially supported (alpha)
  • - Not supported yet

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

You can’t perform that action at this time.