# Google Cloud DLP 

This notebooks explored Google's Data Prevention Loss (a.k.a [DLP](https://cloud.google.com/security/products/dlp?hl=es)) API to identify `Personal Identifyable Information` (PII)

> ℹ️ Know more check the [DLP API](https://cloud.google.com/sensitive-data-protection/docs/libraries#client-libraries-install-python) & the [How-To guides](https://cloud.google.com/sensitive-data-protection/docs/how-to)

> ⚠️ Requires the `GOOGLE_APPLICATION_CREDENTIAL` env var


<details>
<summary>👉 Possible PII types: (click to expand)</summary>

```
# Personally Identifiable Information (PII) Categorization

## Country-Specific Identifiers

### Australia
- AUSTRALIA_DRIVERS_LICENSE_NUMBER
- AUSTRALIA_MEDICARE_NUMBER
- AUSTRALIA_PASSPORT
- AUSTRALIA_TAX_FILE_NUMBER

### Argentina
- ARGENTINA_DNI_NUMBER

### Armenia
- ARMENIA_PASSPORT

### Azerbaijan
- AZERBAIJAN_PASSPORT

### Belarus
- BELARUS_PASSPORT

### Belgium
- BELGIUM_NATIONAL_ID_CARD_NUMBER

### Brazil
- BRAZIL_CPF_NUMBER

### Canada
- CANADA_BANK_ACCOUNT
- CANADA_BC_PHN
- CANADA_DRIVERS_LICENSE_NUMBER
- CANADA_OHIP
- CANADA_PASSPORT
- CANADA_QUEBEC_HIN
- CANADA_SOCIAL_INSURANCE_NUMBER

### Chile
- CHILE_CDI_NUMBER

### China
- CHINA_PASSPORT
- CHINA_RESIDENT_ID_NUMBER

### Colombia
- COLOMBIA_CDC_NUMBER

### Croatia
- CROATIA_PERSONAL_ID_NUMBER

### Denmark
- DENMARK_CPR_NUMBER

### Finland
- FINLAND_BUSINESS_ID
- FINLAND_NATIONAL_ID_NUMBER

### France
- FRANCE_CNI
- FRANCE_DRIVERS_LICENSE_NUMBER
- FRANCE_NIR
- FRANCE_PASSPORT
- FRANCE_TAX_IDENTIFICATION_NUMBER

### Germany
- GERMANY_DRIVERS_LICENSE_NUMBER
- GERMANY_IDENTITY_CARD_NUMBER
- GERMANY_PASSPORT
- GERMANY_SCHUFA_ID
- GERMANY_TAXPAYER_IDENTIFICATION_NUMBER

### Hong Kong
- HONG_KONG_ID_NUMBER

### India
- INDIA_AADHAAR_INDIVIDUAL
- INDIA_GST_INDIVIDUAL
- INDIA_PAN_INDIVIDUAL
- INDIA_PASSPORT

### Indonesia
- INDONESIA_NIK_NUMBER
- INDONESIA_PASSPORT

### Ireland
- IRELAND_DRIVING_LICENSE_NUMBER
- IRELAND_EIRCODE
- IRELAND_PASSPORT
- IRELAND_PPSN

### Israel
- ISRAEL_IDENTITY_CARD_NUMBER

### Italy
- ITALY_FISCAL_CODE
- ITALY_PASSPORT

### Japan
- JAPAN_BANK_ACCOUNT
- JAPAN_CORPORATE_NUMBER
- JAPAN_DRIVERS_LICENSE_NUMBER
- JAPAN_INDIVIDUAL_NUMBER
- JAPAN_PASSPORT

### Kazakhstan
- KAZAKHSTAN_PASSPORT

### Korea
- KOREA_ARN
- KOREA_DRIVERS_LICENSE_NUMBER
- KOREA_PASSPORT
- KOREA_RRN

### Mexico
- MEXICO_CURP_NUMBER
- MEXICO_PASSPORT

### Netherlands
- NETHERLANDS_BSN_NUMBER
- NETHERLANDS_PASSPORT

### New Zealand
- NEW_ZEALAND_IRD_NUMBER
- NEW_ZEALAND_NHI_NUMBER

### Norway
- NORWAY_NI_NUMBER

### Paraguay
- PARAGUAY_CIC_NUMBER
- PARAGUAY_TAX_NUMBER

### Peru
- PERU_DNI_NUMBER

### Poland
- POLAND_NATIONAL_ID_NUMBER
- POLAND_PASSPORT
- POLAND_PESEL_NUMBER

### Portugal
- PORTUGAL_CDC_NUMBER
- PORTUGAL_NIB_NUMBER
- PORTUGAL_SOCIAL_SECURITY_NUMBER

### Russia
- RUSSIA_PASSPORT

### Scotland
- SCOTLAND_COMMUNITY_HEALTH_INDEX_NUMBER

### Singapore
- SINGAPORE_NATIONAL_REGISTRATION_ID_NUMBER
- SINGAPORE_PASSPORT

### South Africa
- SOUTH_AFRICA_ID_NUMBER

### Spain
- SPAIN_CIF_NUMBER
- SPAIN_DNI_NUMBER
- SPAIN_DRIVERS_LICENSE_NUMBER
- SPAIN_NIE_NUMBER
- SPAIN_NIF_NUMBER
- SPAIN_PASSPORT
- SPAIN_SOCIAL_SECURITY_NUMBER

### Sweden
- SWEDEN_NATIONAL_ID_NUMBER
- SWEDEN_PASSPORT

### Switzerland
- SWITZERLAND_SOCIAL_SECURITY_NUMBER

### Taiwan
- TAIWAN_ID_NUMBER
- TAIWAN_PASSPORT

### Thailand
- THAILAND_NATIONAL_ID_NUMBER

### Turkey
- TURKEY_ID_NUMBER

### Ukraine
- UKRAINE_PASSPORT

### United Kingdom
- UK_DRIVERS_LICENSE_NUMBER
- UK_ELECTORAL_ROLL_NUMBER
- UK_NATIONAL_HEALTH_SERVICE_NUMBER
- UK_NATIONAL_INSURANCE_NUMBER
- UK_PASSPORT
- UK_TAXPAYER_REFERENCE

### United States
- US_ADOPTION_TAXPAYER_IDENTIFICATION_NUMBER
- US_BANK_ROUTING_MICR
- US_DEA_NUMBER
- US_DRIVERS_LICENSE_NUMBER
- US_EMPLOYER_IDENTIFICATION_NUMBER
- US_HEALTHCARE_NPI
- US_INDIVIDUAL_TAXPAYER_IDENTIFICATION_NUMBER
- US_MEDICARE_BENEFICIARY_ID_NUMBER
- US_PASSPORT
- US_PREPARER_TAXPAYER_IDENTIFICATION_NUMBER
- US_SOCIAL_SECURITY_NUMBER
- US_STATE
- US_TOLLFREE_PHONE_NUMBER
- US_VEHICLE_IDENTIFICATION_NUMBER

### Uruguay
- URUGUAY_CDI_NUMBER

### Uzbekistan
- UZBEKISTAN_PASSPORT

## Non-Country Specific Identifiers

### Personal Identifiers
- ADVERTISING_ID
- AGE
- AUTH_TOKEN
- DEMOGRAPHIC_DATA
- DOD_ID_NUMBER
- EMAIL_ADDRESS
- EMPLOYMENT_STATUS
- ETHNIC_GROUP
- FEMALE_NAME
- FIRST_NAME
- GENDER
- GENERIC_ID
- GOVERNMENT_ID
- IMMIGRATION_STATUS
- LAST_NAME
- LOCATION
- LOCATION_COORDINATES
- MALE_NAME
- MARITAL_STATUS
- ORGANIZATION_NAME
- PERSON_NAME
- POLITICAL_TERM
- RELIGIOUS_TERM
- SEXUAL_ORIENTATION
- TECHNICAL_ID
- TRADE_UNION

### Financial Identifiers
- AMERICAN_BANKERS_CUSIP_ID
- CREDIT_CARD_DATA
- CREDIT_CARD_EXPIRATION_DATE
- CREDIT_CARD_NUMBER
- CREDIT_CARD_TRACK_NUMBER
- CVV_NUMBER
- FINANCIAL_ACCOUNT_NUMBER
- FINANCIAL_ID
- IBAN_CODE
- SWIFT_CODE
- VAT_NUMBER

### Medical Identifiers
- BLOOD_TYPE
- FDA_CODE
- ICD10_CODE
- ICD9_CODE
- MEDICAL_DATA
- MEDICAL_RECORD_NUMBER
- MEDICAL_TERM

### Technology and Security Identifiers
- AWS_CREDENTIALS
- AZURE_AUTH_TOKEN
- BASIC_AUTH_HEADER
- DOMAIN_NAME
- ENCRYPTION_KEY
- GCP_API_KEY
- GCP_CREDENTIALS
- HTTP_COOKIE
- HTTP_USER_AGENT
- ICCID_NUMBER
- IMEI_HARDWARE_ID
- IMSI_ID
- IP_ADDRESS
- JSON_WEB_TOKEN
- MAC_ADDRESS
- MAC_ADDRESS_LOCAL
- OAUTH_CLIENT_SECRET
- PASSWORD
- SECURITY_DATA
- SSL_CERTIFICATE
- STORAGE_SIGNED_POLICY_DOCUMENT
- STORAGE_SIGNED_URL
- TINK_KEYSET
- URL
- WEAK_PASSWORD_HASH
- XSRF_TOKEN

### Other Identifiers
- DATE
- DATE_OF_BIRTH
- GEOGRAPHIC_DATA
- PASSPORT
- PHONE_NUMBER
- STREET_ADDRESS
- TIME
- VEHICLE_IDENTIFICATION_NUMBER

### Document Types
- DOCUMENT_TYPE/FINANCE/REGULATORY
- DOCUMENT_TYPE/FINANCE/SEC_FILING
- DOCUMENT_TYPE/HR/RESUME
- DOCUMENT_TYPE/LEGAL/BLANK_FORM
- DOCUMENT_TYPE/LEGAL/BRIEF
- DOCUMENT_TYPE/LEGAL/COURT_ORDER
- DOCUMENT_TYPE/LEGAL/LAW
- DOCUMENT_TYPE/LEGAL/PLEADING
- DOCUMENT_TYPE/R&D/DATABASE_BACKUP
- DOCUMENT_TYPE/R&D/PATENT
- DOCUMENT_TYPE/R&D/SOURCE_CODE
- DOCUMENT_TYPE/R&D/SYSTEM_LOG
```
</details>

## ⚙️ Setup

In [1]:
# install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

# install python deps
!uv pip install -q --system \
    python-dotenv \
    "google-cloud-dlp==3.25.1" \
    "google-cloud-storage==2.9.0" \
    "google-cloud-pubsub==2.28.0" \
    "google-cloud-datastore==2.20.1" \
    "google-cloud-bigquery==3.27.0"

downloading uv 0.6.10 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!


## 📚 Data

In [4]:
from pathlib import Path

DATA_DIR = Path("/datasets/client-data-us/")

md_docs = list(DATA_DIR.rglob("**/*.md"))
print(f"Total markdown documents: {len(md_docs)}")

text = md_docs[0].open("r").read()
print(text[:200])

Total markdown documents: 60
#### **MASTER LICENSE AGREEMENT**

**Between Customer and Supply Chain Consultants, Inc. d/b/a Arkieva Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 Telephone: 302-738-9215 Fed ID/TIN: 


## 🕵️ DLP

In [5]:
import os
from dotenv import load_dotenv


load_dotenv()

# Set the Project ID and credentials (to be configured in the Google Console)
PROJECT_ID = "dlp-test-fml"

In [6]:
# Import the client library
from google.cloud import dlp_v2


def list_info_types(dlp_client):
    """
    Retrieve and print all available info types in DLP API
    """
    # List all supported info types
    request = {"parent": "projects/-"}
    response = dlp_client.list_info_types(request=request)
    
    info_type_names = sorted([it.name for it in response.info_types])
    print("Available Info Types:")
    for name in sorted(info_type_names):
        print(f"- {name}")


def inspect(clp_client, project_id, content, info_types: list[dict]):

    # Define the inspection config
    inspect_config = {
        "info_types": info_types,
        "include_quote": True,
        "min_likelihood": dlp_v2.Likelihood.POSSIBLE, # Or LIKELIHOOD_UNSPECIFIED
        "limits": {"max_findings_per_request": 0}  # 0 means no limit
    }
    
    # Construct the item to inspect.
    item = {"value": content}

    # Whether to include the matching string in the results. Optional.
    include_quote = True

    # Configure the inspect request
    parent = f"projects/{project_id}"

    # Call the API.
    response = dlp_client.inspect_content(
        request={"parent": parent, "inspect_config": inspect_config, "item": item}
    )

    # Print out the results
    results = []
    if response.result.findings:
        for finding in response.result.findings:
            results.append({
                "info_type": finding.info_type.name,
                "likelihood": dlp_v2.Likelihood(finding.likelihood).name,
            })
            try:
                results[-1].update({"quote": finding.quote})
            except AttributeError:
                pass

    return results


In [10]:
# Instantiate a client.
dlp_client = dlp_v2.DlpServiceClient()

# Run the function to see available types
# list_info_types(dlp_client)

## 🫥 Anonymisation

In [9]:
DEFAULT_INFO_TYPES = [
    # Banking
    {"name": "CREDIT_CARD_NUMBER"},
    {"name": "IBAN_CODE"},
    {"name": "SWIFT_CODE"},
    # Personal
    {"name": "PERSON_NAME"},
    {"name": "DATE_OF_BIRTH"},
    # Contact details
    {"name": "PHONE_NUMBER"},
    {"name": "EMAIL_ADDRESS"},
    # Locations
    {"name": "LOCATION"},
]

# Run the inspection
detected_pii_entities = inspect(dlp_client, PROJECT_ID, text, DEFAULT_INFO_TYPES)
for pii in detected_pii_entities:
    print(f"Type: {pii['info_type']} | Text: {pii['quote']} | Confidence: ({pii['likelihood']})")

Type: PERSON_NAME | Text: Arkieva Linden | Confidence: (LIKELY)
Type: LOCATION | Text: Arkieva Linden Green Center | Confidence: (LIKELY)
Type: LOCATION | Text: 5460 Fairmont Drive Wilmington, DE 19808 | Confidence: (LIKELY)
Type: PHONE_NUMBER | Text: 302-738-9215 | Confidence: (VERY_LIKELY)
Type: PHONE_NUMBER | Text: 51-035 0007 | Confidence: (POSSIBLE)
Type: PERSON_NAME | Text: BUCKMAN | Confidence: (POSSIBLE)
Type: PERSON_NAME | Text: McLean | Confidence: (POSSIBLE)
Type: LOCATION | Text: 1256 North McLean Boulevard, Memphis Tennessee 38108 | Confidence: (LIKELY)
Type: PHONE_NUMBER | Text: 38108-1241 | Confidence: (LIKELY)
Type: PHONE_NUMBER | Text: (901) 278-0330 | Confidence: (VERY_LIKELY)
Type: PERSON_NAME | Text: Buckman | Confidence: (POSSIBLE)
Type: PERSON_NAME | Text: Sujit K. Singh | Confidence: (LIKELY)
Type: PERSON_NAME | Text: COO, Arkieva | Confidence: (POSSIBLE)
Type: PERSON_NAME | Text: Buckman | Confidence: (POSSIBLE)
Type: PERSON_NAME | Text: Buckman | Confidence: (L

In [24]:
import copy
import re

def mask_text(text, detected_pii_entities):
    masked_text = copy.copy(text)

    # Sort so longer spans come first in case there's overlap
    sorted_pii = sorted(
        detected_pii_entities, key=lambda p: len(p["quote"]), reverse=True
    )
    # Mask the entities in the original text
    for pii in sorted_pii:
        pii_text = pii["quote"]
        masked_text = re.sub(
            pii_text, f"[{pii['info_type']} REDACTED]" , masked_text, count=0, flags=0
        )
        
    return masked_text
    
print(mask_text(text, detected_pii_entities))

#### **MASTER LICENSE AGREEMENT**

**Between Customer and Supply Chain Consultants, Inc. d/b/a [LOCATION REDACTED] [LOCATION REDACTED] Telephone: [PHONE_NUMBER REDACTED] Fed ID/TIN: [PHONE_NUMBER REDACTED]**

Customer Name: [PERSON_NAME REDACTED] LABORATORIES INTERNATIONAL, INC.

Address: [LOCATION REDACTED]-1241, U.S.A. Telephone: (901) 278-0330

ATTN:

| 1.0  | DEFINITIONS<br>3             |
|------|------------------------------|
| 2.0  | LICENSE3                     |
| 3.0  | USE3                         |
| 4.0  | PAYMENT<br>4                 |
| 5.0  | SOFTWARE MAINTENANCE4        |
| 6.0  | WARRANTIES AND REMEDIES<br>4 |
| 7.0  | INDEMNITY5                   |
| 8.0  | LIMITATION OF LIABILITY5     |
| 9.0  | OWNERSHIP6                   |
| 10.0 | TERMINATION OF DISTRIBUTOR6  |
| 11.0 | CONFIDENTIALITY<br>6         |
| 12.0 | TERMINATION7                 |
| 13.0 | QUALITY CONTROL7             |
| 14.0 | ASSIGNMENT7                  |
| 15.0 | U.S. EXPORT RESTRICTIONS7    |
| 1