# Privacy Scrubbing with Loclean

This notebook demonstrates how to scrub sensitive PII (Personally Identifiable Information) data locally using Loclean.

> **⚡ Zero Setup:** Loclean auto-starts the Ollama daemon and auto-pulls models on first use.

> **📚 Full Documentation:** [Privacy Scrubbing Guide](https://nxank4.github.io/loclean/guides/privacy/)

In [None]:
!pip install loclean

In [1]:
import polars as pl

import loclean

## Basic Usage

### Scrub Text

In [None]:
# Text with PII
text = "Contact John Doe at john@example.com or call 555-1234"

# Scrub all PII (default: mask mode)
cleaned = loclean.scrub(text)
print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

### Scrub DataFrame

In [3]:
df = pl.DataFrame(
    {
        "text": [
            "Contact John Doe at john@example.com",
            "Call Mary Smith at 555-1234",  # US phone format
            "Email: admin@company.com",
        ]
    }
)

print("Original DataFrame:")
print(df)

# Scrub PII in DataFrame column
# Note: Returns DataFrame with scrubbed column (same structure as input)
# Phone detection supports multiple international formats:
# - US/Canada: (555) 123-4567, 555-123-4567, 555-1234
# - International: +44 20 7946 0958, +33 1 23 45 67 89
# - Vietnamese: 0909123456, +84901234567
result = loclean.scrub(df, target_col="text")

print("\nCleaned DataFrame:")
print(result)

Original DataFrame:
shape: (3, 1)
┌─────────────────────────────────┐
│ text                            │
│ ---                             │
│ str                             │
╞═════════════════════════════════╡
│ Contact John Doe at john@examp… │
│ Call Mary Smith at 555-1234     │
│ Email: admin@company.com        │
└─────────────────────────────────┘

Cleaned DataFrame:
shape: (3, 1)
┌─────────────────────────────┐
│ text                        │
│ ---                         │
│ str                         │
╞═════════════════════════════╡
│ Contact [PERSON] at [EMAIL] │
│ Call [PERSON] at [PHONE]    │
│ Email: [EMAIL]              │
└─────────────────────────────┘


## Scrubbing Modes

### Mask Mode (Default)

Replaces PII with type-specific placeholders like `[PERSON]`, `[EMAIL]`, `[PHONE]`:

In [4]:
text = "John Doe: john@example.com"
cleaned = loclean.scrub(text, mode="mask")
print(f"Original: {text}")
print(f"Masked:   {cleaned}")

Original: John Doe: john@example.com
Masked:   [PERSON]: [EMAIL]


## Selective Scrubbing

Scrub only specific PII types by specifying `strategies`. Available strategies:
- `"person"`: Person names (requires LLM)
- `"phone"`: Phone numbers
- `"email"`: Email addresses
- `"credit_card"`: Credit card numbers
- `"address"`: Physical addresses (requires LLM)
- `"ip_address"`: IP addresses

### Fake Mode

Replace PII with fake data instead of masking:

In [5]:
# Replace PII with fake data (mode="fake")
text = "Contact John Doe at john@example.com or call 555-1234"
cleaned = loclean.scrub(
    text,
    mode="fake",
    locale="en_US",  # Use English locale for fake data
)
print(f"Original: {text}")
print(f"Fake:     {cleaned}")

Original: Contact John Doe at john@example.com or call 555-1234
Fake:     Contact Michael Rodriguez at ianderson@example.net or call 357-4163


In [6]:
# Only scrub emails and phone numbers
# Note: "person" is not in strategies, so "John Doe" remains unchanged
text = "John Doe: john@example.com, 555-1234"
cleaned = loclean.scrub(text, strategies=["email", "phone"])
print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

Original: John Doe: john@example.com, 555-1234
Cleaned:  John Doe: [EMAIL], [PHONE]


## Best Practices

### Tips for Better Scrubbing

1. **Choose the right mode**: Use `mask` for anonymization, `fake` for realistic test data
2. **Selective scrubbing**: Use `strategies` parameter to scrub only specific PII types
3. **Locale support**: Set `locale` for fake data generation (e.g., `"en_US"`, `"vi_VN"`)
4. **Multi-country support**: Phone numbers and addresses are detected across multiple countries
5. **LLM-based detection**: Person names and addresses require LLM inference (automatic)

## Next Steps

- **Quick Start:** See [01-quick-start.ipynb](./01-quick-start.ipynb) for an overview of all features
- **Data Cleaning:** See [02-data-cleaning.ipynb](./02-data-cleaning.ipynb) for data normalization
- **Structured Extraction:** See [04-structured-extraction.ipynb](./04-structured-extraction.ipynb) for complex schemas
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)