# Getting Started with CloakPivot

This notebook provides a comprehensive introduction to CloakPivot, demonstrating basic document masking and unmasking operations.

## What You'll Learn

- How to perform basic document masking
- Understanding CloakMaps and their role in reversible masking
- Working with different masking strategies
- Basic policy configuration
- Unmasking documents to restore original content

## Prerequisites

Make sure you have CloakPivot installed:

```bash
pip install cloakpivot
```

In [None]:
# Import required libraries
import json
import tempfile
from pathlib import Path

# Import CloakPivot components
from cloakpivot import (
    CloakMap,
    MaskingPolicy,
    Strategy,
    StrategyKind,
)

## Creating Sample Data

Let's start by creating a sample document with PII that we can use for demonstration purposes.

In [None]:
# Create sample document content in DocPivot JSON format
sample_document = {
    "name": "Sample Employee Record",
    "texts": [
        {
            "text": "Employee Information for John Doe",
            "type": "title",
            "node_id": "title_001"
        },
        {
            "text": "Contact John Doe at john.doe@company.com or call (555) 123-4567 for any questions.",
            "type": "text",
            "node_id": "text_001"
        },
        {
            "text": "Employee ID: EMP-2023-001, SSN: 123-45-6789, Start Date: 2023-01-15",
            "type": "text",
            "node_id": "text_002"
        },
        {
            "text": "Emergency Contact: Jane Doe (spouse) - jane.doe@email.com, (555) 987-6543",
            "type": "text",
            "node_id": "text_003"
        }
    ],
    "metadata": {
        "created_at": "2023-12-07T10:00:00Z",
        "document_type": "employee_record"
    }
}

# Save to temporary file
temp_dir = Path(tempfile.mkdtemp())
input_file = temp_dir / "employee_record.docling.json"

with open(input_file, 'w') as f:
    json.dump(sample_document, f, indent=2)

print(f"Sample document created at: {input_file}")
print("\nDocument content:")
for text_item in sample_document["texts"]:
    print(f"- {text_item['text']}")

## Basic Masking with Default Policy

Let's start with the simplest masking operation using the default policy.

In [None]:
# Perform basic masking with default policy
result = mask_document(
    input_path=str(input_file),
    output_format="docling",
    cloakmap_path=str(temp_dir / "basic_mask.cloakmap.json")
)

print("✅ Masking completed!")
print(f"Masked document: {result.masked_path}")
print(f"CloakMap: {result.cloakmap_path}")
print(f"Entities processed: {result.stats.total_entities_found}")

# Load and display the masked content
with open(result.masked_path) as f:
    masked_doc = json.load(f)

print("\n📄 Masked document content:")
for text_item in masked_doc["texts"]:
    print(f"- {text_item['text']}")

## Examining the CloakMap

The CloakMap contains the information needed to reverse the masking process. Let's examine its structure.

In [None]:
# Load and examine the CloakMap
with open(result.cloakmap_path) as f:
    cloakmap_data = json.load(f)

cloakmap = CloakMap.from_dict(cloakmap_data)

print("🗺️ CloakMap Information:")
print(f"Document ID: {cloakmap.doc_id}")
print(f"Total anchors: {len(cloakmap.anchors)}")
print(f"Created at: {cloakmap.created_at}")

print("\n📊 Entity breakdown:")
entity_counts = cloakmap.entity_count_by_type
for entity_type, count in entity_counts.items():
    print(f"- {entity_type}: {count}")

print("\n🔍 Sample anchors:")
for i, anchor in enumerate(cloakmap.anchors[:3]):
    print(f"- Anchor {i+1}: {anchor.entity_type} at {anchor.node_id}:{anchor.start}-{anchor.end}")
    print(f"  Masked as: {anchor.masked_value}")
    print(f"  Strategy: {anchor.strategy_used}")

## Custom Policy Configuration

Let's create a custom policy with different strategies for different types of PII.

In [None]:
# Create a custom masking policy
custom_policy = MaskingPolicy(
    locale="en",
    seed="demo-seed-2023",  # For deterministic results

    # Default strategy for unspecified entity types
    default_strategy=Strategy(
        kind=StrategyKind.REDACT,
        parameters={"redact_char": "*", "preserve_length": True}
    ),

    # Specific strategies for different entity types
    per_entity={
        "PERSON": Strategy(
            kind=StrategyKind.TEMPLATE,
            parameters={"template": "[PERSON]"}
        ),
        "EMAIL_ADDRESS": Strategy(
            kind=StrategyKind.PARTIAL,
            parameters={
                "visible_chars": 3,
                "position": "start",
                "format_aware": True
            }
        ),
        "PHONE_NUMBER": Strategy(
            kind=StrategyKind.PARTIAL,
            parameters={
                "visible_chars": 4,
                "position": "end"
            }
        ),
        "US_SSN": Strategy(
            kind=StrategyKind.TEMPLATE,
            parameters={"template": "XXX-XX-XXXX"}
        )
    },

    # Confidence thresholds
    thresholds={
        "PERSON": 0.7,
        "EMAIL_ADDRESS": 0.6,
        "PHONE_NUMBER": 0.7,
        "US_SSN": 0.9
    }
)

print("✅ Custom policy created with the following strategies:")
print(f"- Default: {custom_policy.default_strategy.kind.value}")
for entity_type, strategy in custom_policy.per_entity.items():
    print(f"- {entity_type}: {strategy.kind.value}")

## Masking with Custom Policy

Now let's apply our custom policy to the same document and compare the results.

In [None]:
# Apply custom policy
custom_result = mask_document(
    input_path=str(input_file),
    policy=custom_policy,
    output_format="docling",
    cloakmap_path=str(temp_dir / "custom_mask.cloakmap.json")
)

print("✅ Custom masking completed!")
print(f"Entities processed: {custom_result.stats.total_entities_found}")

# Load and display the masked content
with open(custom_result.masked_path) as f:
    custom_masked_doc = json.load(f)

print("\n📄 Custom masked document content:")
for text_item in custom_masked_doc["texts"]:
    print(f"- {text_item['text']}")

print("\n🔍 Comparison with default masking:")
print("\nDefault policy result:")
for text_item in masked_doc["texts"]:
    print(f"- {text_item['text']}")

print("\nCustom policy result:")
for text_item in custom_masked_doc["texts"]:
    print(f"- {text_item['text']}")

## Unmasking Documents

The key feature of CloakPivot is the ability to perfectly restore the original content using the CloakMap.

In [None]:
# Unmask the document using the CloakMap
unmasked_result = unmask_document(
    masked_path=custom_result.masked_path,
    cloakmap_path=custom_result.cloakmap_path,
    output_format="docling"
)

print("✅ Unmasking completed!")
print(f"Restored document: {unmasked_result.restored_path}")

# Load and verify the restored content
with open(unmasked_result.restored_path) as f:
    restored_doc = json.load(f)

print("\n📄 Restored document content:")
for text_item in restored_doc["texts"]:
    print(f"- {text_item['text']}")

# Verify round-trip accuracy
original_texts = [item['text'] for item in sample_document['texts']]
restored_texts = [item['text'] for item in restored_doc['texts']]

print("\n🔍 Round-trip verification:")
if original_texts == restored_texts:
    print("✅ Perfect round-trip! Original and restored content match exactly.")
else:
    print("❌ Round-trip failed - content differs.")
    for i, (orig, rest) in enumerate(zip(original_texts, restored_texts)):
        if orig != rest:
            print(f"  Text {i+1}:")
            print(f"    Original: {orig}")
            print(f"    Restored: {rest}")

## Exploring Different Strategies

Let's demonstrate the different masking strategies available in CloakPivot.

In [None]:
# Sample text with various PII types
sample_text = {
    "name": "Strategy Demonstration",
    "texts": [{
        "text": "Contact Dr. Sarah Wilson at sarah.wilson@hospital.org or (555) 123-4567. Patient SSN: 987-65-4321. Credit Card: 4532-1234-5678-9012.",
        "type": "text",
        "node_id": "demo_001"
    }],
    "metadata": {"demo": True}
}

# Create different policies to demonstrate each strategy
strategies_demo = {
    "Redaction": MaskingPolicy(
        default_strategy=Strategy(kind=StrategyKind.REDACT,
                                 parameters={"redact_char": "█"})
    ),

    "Template": MaskingPolicy(
        per_entity={
            "PERSON": Strategy(kind=StrategyKind.TEMPLATE,
                              parameters={"template": "[DOCTOR]"}),
            "EMAIL_ADDRESS": Strategy(kind=StrategyKind.TEMPLATE,
                                     parameters={"template": "[EMAIL]"}),
            "PHONE_NUMBER": Strategy(kind=StrategyKind.TEMPLATE,
                                    parameters={"template": "[PHONE]"}),
            "US_SSN": Strategy(kind=StrategyKind.TEMPLATE,
                              parameters={"template": "[SSN]"}),
            "CREDIT_CARD": Strategy(kind=StrategyKind.TEMPLATE,
                                   parameters={"template": "[CARD]"})
        }
    ),

    "Partial": MaskingPolicy(
        default_strategy=Strategy(kind=StrategyKind.PARTIAL,
                                 parameters={"visible_chars": 3, "position": "start"})
    ),

    "Hash": MaskingPolicy(
        default_strategy=Strategy(kind=StrategyKind.HASH,
                                 parameters={"algorithm": "sha256", "truncate": 8})
    )
}

print("🎭 Demonstrating Different Masking Strategies\n")

for strategy_name, policy in strategies_demo.items():
    # Create sample file
    sample_file = temp_dir / f"sample_{strategy_name.lower()}.docling.json"
    with open(sample_file, 'w') as f:
        json.dump(sample_text, f)

    # Apply strategy
    result = mask_document(
        input_path=str(sample_file),
        policy=policy,
        output_format="docling"
    )

    # Load result
    with open(result.masked_path) as f:
        masked = json.load(f)

    print(f"📝 {strategy_name} Strategy:")
    print(f"   Original: {sample_text['texts'][0]['text']}")
    print(f"   Masked:   {masked['texts'][0]['text']}")
    print()

## Working with Policy Files

In practice, you'll often want to define policies in YAML files for reusability and version control.

In [None]:
# Create a YAML policy file
yaml_policy = """
version: "1.0"
name: "healthcare-demo"
description: "Healthcare demonstration policy"

locale: "en"
seed: "healthcare-demo-seed"

default_strategy:
  kind: "redact"
  parameters:
    redact_char: "*"
    preserve_length: true

per_entity:
  PERSON:
    kind: "template"
    parameters:
      template: "[PATIENT]"
    threshold: 0.8
    enabled: true
    
  EMAIL_ADDRESS:
    kind: "partial"
    parameters:
      visible_chars: 2
      position: "start"
      format_aware: true
    threshold: 0.7
    enabled: true
    
  US_SSN:
    kind: "template"
    parameters:
      template: "XXX-XX-XXXX"
    threshold: 0.95
    enabled: true

context_rules:
  heading:
    enabled: false
  table:
    enabled: true

allow_list:
  - "Emergency Contact"
  - "Hospital Administration"

deny_list:
  - "confidential"
  - "classified"
"""

# Save policy to file
policy_file = temp_dir / "healthcare_policy.yaml"
with open(policy_file, 'w') as f:
    f.write(yaml_policy)

print(f"✅ Policy file created: {policy_file}")

# Use the policy file with mask_document
yaml_result = mask_document(
    input_path=str(input_file),
    policy_path=str(policy_file),  # Use policy file instead of policy object
    output_format="docling"
)

print("✅ Masking with YAML policy completed!")

# Display result
with open(yaml_result.masked_path) as f:
    yaml_masked = json.load(f)

print("\n📄 YAML policy result:")
for text_item in yaml_masked["texts"]:
    print(f"- {text_item['text']}")

## Summary

In this notebook, we've covered:

✅ **Basic masking operations** with default and custom policies  
✅ **CloakMap structure** and how it enables reversible masking  
✅ **Different masking strategies** (redact, template, partial, hash)  
✅ **Perfect round-trip restoration** using unmasking  
✅ **Policy configuration** both programmatically and with YAML files  

## Next Steps

- **Explore advanced policies**: Learn about context rules, allow/deny lists, and inheritance
- **Batch processing**: Process multiple documents efficiently
- **Security features**: Encrypt CloakMaps and manage keys
- **Integration**: Use CloakPivot in your data processing pipelines
- **Format support**: Work with different document formats and converters

## Resources

- [Policy Development Guide](../policies/creating_policies.rst)
- [CLI Usage Examples](../cli/overview.rst)
- [Advanced Features Notebook](02_advanced_features.ipynb)
- [Industry Use Cases](../examples/industry_scenarios.rst)

In [None]:
# Cleanup temporary files
import shutil

shutil.rmtree(temp_dir)
print(f"🧹 Cleaned up temporary directory: {temp_dir}")