# PII Detection Experiments with Hugging Face Dataset

This notebook explores the [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k) dataset and tests OpenAI's ability to detect PII.

## Setup and Imports

In [3]:
# Install required packages if needed
# !pip install datasets openai python-dotenv

In [26]:
import os
import random
from pathlib import Path
from datasets import load_dataset
from openai import OpenAI
import json
from dotenv import load_dotenv

# Load environment variables from .env file in parent directory (project root)
env_file = Path(__file__).parent.parent / '.env' if '__file__' in globals() else Path.cwd().parent / '.env'

print(f"Looking for .env file at: {env_file}")
if env_file.exists():
    print(f"‚úì Found .env file")
    load_dotenv(dotenv_path=env_file)
else:
    print(f"‚ö†Ô∏è  .env file not found at {env_file}")
    print("   Will try loading from default locations")
    load_dotenv()

Looking for .env file at: /Users/kodlan/workspaces/python/pii-llm/.env
‚úì Found .env file


## Load the Dataset

Loading the PII masking dataset from Hugging Face.

In [27]:
print("Loading dataset...")

# Get Hugging Face token from environment for faster downloads (optional)
hf_token = os.getenv("HF_TOKEN")

if hf_token:
    # Check if it's a placeholder value
    if hf_token.startswith("your-") or "placeholder" in hf_token.lower() or "example" in hf_token.lower():
        print("‚ö†Ô∏è  HF_TOKEN appears to be a placeholder value - will use unauthenticated access")
        print("   Edit your .env file and add your actual Hugging Face token for faster downloads")
        hf_token = None
    else:
        print("‚úì Using HF_TOKEN for authenticated access (faster downloads)")
else:
    print("‚ö†Ô∏è  No HF_TOKEN found - using unauthenticated access (slower)")
    print("   Get a token at https://huggingface.co/settings/tokens to speed up downloads")

# Load dataset with token if available
dataset = load_dataset("ai4privacy/pii-masking-200k", token=hf_token if hf_token else None)
train_data = dataset["train"]

print(f"\nDataset loaded successfully!")
print(f"Total number of samples: {len(train_data)}")
print(f"\nColumn names: {train_data.column_names}")

Loading dataset...
‚úì Using HF_TOKEN for authenticated access (faster downloads)

Dataset loaded successfully!
Total number of samples: 209261

Column names: ['source_text', 'target_text', 'privacy_mask', 'span_labels', 'mbert_text_tokens', 'mbert_bio_labels', 'id', 'language', 'set']


## Sample Random Entries

Filtering the dataset to English samples only, then selecting 5 random samples with a fixed seed for reproducibility.

In [28]:
# Set random seed for reproducibility
random.seed(42)

# Set number of samples
number_samples = 10

# Filter dataset to English only
print("Filtering dataset to English samples only...")
english_data = train_data.filter(lambda x: x['language'] == 'en')
print(f"English samples: {len(english_data)} out of {len(train_data)} total")

# Sample random indices from English data
sample_indices = random.sample(range(len(english_data)), number_samples)
print(f"\nSelected indices from English dataset: {sample_indices}")

# Get the samples
samples = [english_data[i] for i in sample_indices]
print(f"Successfully sampled {len(samples)} English entries")

# Verify all samples are English
languages = [s['language'] for s in samples]
print(f"Languages: {languages}")
assert all(lang == 'en' for lang in languages), "Not all samples are English!"

Filtering dataset to English samples only...
English samples: 43501 out of 209261 total

Selected indices from English dataset: [41905, 7296, 1639, 18024, 16049, 14628, 9144, 6717, 35741, 5697]
Successfully sampled 10 English entries
Languages: ['en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en', 'en']


## Examine the Samples

Let's look at each sample to understand the data structure.

In [29]:
for i, sample in enumerate(samples):
    print(f"\n{'='*80}")
    print(f"SAMPLE {i} (ID: {sample['id']}, Language: {sample['language']})")
    print(f"{'='*80}")
    
    print(f"\nüìÑ SOURCE TEXT (with PII):")
    print(f"{sample['source_text']}")
    
    print(f"\nüîí TARGET TEXT (masked):")
    print(f"{sample['target_text']}")
    
    print(f"\nüè∑Ô∏è  GROUND TRUTH PII:")
    for pii_item in sample['privacy_mask']:
        print(f"  - {pii_item['label']:20s} : '{pii_item['value']}' (pos {pii_item['start']}-{pii_item['end']})")


SAMPLE 0 (ID: 207666, Language: en)

üìÑ SOURCE TEXT (with PII):
Analyst, we need you to investigate further about the Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4 case. Draw insights from this situation and propose how it can affect our company policies.

üîí TARGET TEXT (masked):
[JOBTYPE], we need you to investigate further about the [USERAGENT] case. Draw insights from this situation and propose how it can affect our company policies.

üè∑Ô∏è  GROUND TRUTH PII:
  - JOBTYPE              : 'Analyst' (pos 0-7)
  - USERAGENT            : 'Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4' (pos 54-128)

SAMPLE 1 (ID: 173057, Language: en)

üìÑ SOURCE TEXT (with PII):
Our main blockchain consultant identified 13eBdNVhaUZ6RPaCbiD1jPWVnEKaRCUU and 0xffff97e7cbbdc819da9d6ae8ef80deaf4bf39cec as two significant cryptocurrency addresses to monitor considering their possible impact on our business continuity.

üîí TARGET TEXT (masked)

## Setup OpenAI API

Configure the OpenAI client using the API key from environment variables.

In [30]:
# Get API key from environment variable
api_key = os.getenv("OPENAI_API_KEY")

# Validate API key
if not api_key:
    raise ValueError("OPENAI_API_KEY not found in environment variables.\n\n")

# Check if it's still the placeholder value from .env.example
if api_key.startswith("your-") or "placeholder" in api_key.lower() or "example" in api_key.lower():
    raise ValueError("OPENAI_API_KEY appears to be the placeholder value from .env.example.\n\n")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)
print("‚úì OpenAI client initialized successfully")

‚úì OpenAI client initialized successfully


## Define PII Detection Prompt

Create a prompt that asks the LLM to identify all PII in the text.

In [31]:
def gety_pii_prompt_simplest(text):
    """Create a prompt for PII detection."""
    return f"""For the text provided below identify all personally identifiable information (PII) in it.
               Return your response as a JSON array that contains all the portions of text identified as PII.
               Only return the JSON, no additional text. JSON array should be named pii, for example {{'pii': ['value1', 'value']}}.
               Text to analyze: {text}
               """


## Send Samples to OpenAI for PII Detection

Process each sample and get PII detection results from OpenAI.

In [32]:
def detect_pii_with_model(sample, model_name, client):
    """
    Detect PII in a sample using a specific OpenAI model.
    
    Args:
        sample: Dataset sample containing 'source_text'
        model_name: OpenAI model name (e.g., 'gpt-4o-mini', 'gpt-5.2')
        client: OpenAI client instance
        
    Returns:
        dict: Result containing detected PII or error information
    """

    prompt = gety_pii_prompt_simplest(sample['source_text'])

    result = {
            "sample_id": sample['id'],
            "source_text": sample['source_text'],
            "ground_truth": sample['privacy_mask'],
            "model": model_name
    }

    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": "You are a PII detection expert. Return only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,  # deterministic output
            response_format={"type": "json_object"}
        )
        
        # Extract the response
        llm_response = response.choices[0].message.content
        
        detected_pii = json.loads(llm_response)

        result["detected_pii"] = detected_pii
        result["raw_response"] = llm_response

        return result
        
    except json.JSONDecodeError as e:
        result["error"] = "Invalid JSON response"
        return result
    except Exception as e:
        result["error"] = str(e)
        return result


## Send Samples to OpenAI for PII Detection

Process each sample and get PII detection results from two OpenAI models:
- **gpt-4o-mini**: Cost-effective baseline model
- **gpt-5.2**: Latest and most capable model (as of Jan 2026)

In [33]:
models_to_test = [
    "gpt-4o-mini",  # Cost-effective baseline
    "gpt-5.2"       # Latest flagship model
]

for i, sample in enumerate(samples):
    print(f"\n{'='*80}")
    print(f"Processing Sample {i}/{len(samples)} (ID: {sample['id']})")
    print(f"{'='*80}")
    
    # Show sample text
    print(f"\nüìÑ Text: {sample['source_text']}")

    # Show expected PII (ground truth)
    print(f"\nüè∑Ô∏è  Expected PII ({len(sample['privacy_mask'])} items):")
    for pii_item in sample['privacy_mask']:
        print(f"   ‚Ä¢ {pii_item['label']:20s} : '{pii_item['value']}'")
    
    # Test with each model
    for model_name in models_to_test:
        print(f"\nü§ñ Testing with {model_name}...")
        
        result = detect_pii_with_model(sample, model_name, client)

        if 'error' in result:
            print(f"   ‚ùå Error: {result['error']}")
            print(f"   ‚úì Raw Response: {result['raw_response'].strip().replace('\n', '')}")
        else:
            print(f"   ‚úì Response: {", ".join(result["detected_pii"]["pii"])}")


Processing Sample 0/10 (ID: 207666)

üìÑ Text: Analyst, we need you to investigate further about the Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4 case. Draw insights from this situation and propose how it can affect our company policies.

üè∑Ô∏è  Expected PII (2 items):
   ‚Ä¢ JOBTYPE              : 'Analyst'
   ‚Ä¢ USERAGENT            : 'Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4'

ü§ñ Testing with gpt-4o-mini...
   ‚úì Response: 

ü§ñ Testing with gpt-5.2...
   ‚úì Response: 

Processing Sample 1/10 (ID: 173057)

üìÑ Text: Our main blockchain consultant identified 13eBdNVhaUZ6RPaCbiD1jPWVnEKaRCUU and 0xffff97e7cbbdc819da9d6ae8ef80deaf4bf39cec as two significant cryptocurrency addresses to monitor considering their possible impact on our business continuity.

üè∑Ô∏è  Expected PII (2 items):
   ‚Ä¢ BITCOINADDRESS       : '13eBdNVhaUZ6RPaCbiD1jPWVnEKaRCUU'
   ‚Ä¢ ETHEREUMADDRESS      : '0xffff97e7cbbdc819da9d6ae8ef80d