# PII Detection Experiments with Hugging Face Dataset

This notebook explores the [ai4privacy/pii-masking-200k](https://huggingface.co/datasets/ai4privacy/pii-masking-200k) dataset and tests OpenAI's ability to detect PII.

## Setup and Imports

In [7]:
# Install required packages if needed
# !pip install datasets openai python-dotenv

In [11]:
import os
import random
from pathlib import Path
from datasets import load_dataset
from openai import OpenAI
import json
from dotenv import load_dotenv

# Load environment variables from .env file in parent directory (project root)
env_file = Path(__file__).parent.parent / '.env' if '__file__' in globals() else Path.cwd().parent / '.env'

print(f"Looking for .env file at: {env_file}")
if env_file.exists():
    print(f"‚úì Found .env file")
    load_dotenv(dotenv_path=env_file)
else:
    print(f"‚ö†Ô∏è  .env file not found at {env_file}")
    print("   Will try loading from default locations")
    load_dotenv()

Looking for .env file at: /Users/kodlan/workspaces/python/pii-llm/.env
‚úì Found .env file


## Load the Dataset

Loading the PII masking dataset from Hugging Face.

In [12]:
print("Loading dataset...")

# Get Hugging Face token from environment for faster downloads (optional)
hf_token = os.getenv("HF_TOKEN")

if hf_token:
    # Check if it's a placeholder value
    if hf_token.startswith("your-") or "placeholder" in hf_token.lower() or "example" in hf_token.lower():
        print("‚ö†Ô∏è  HF_TOKEN appears to be a placeholder value - will use unauthenticated access")
        print("   Edit your .env file and add your actual Hugging Face token for faster downloads")
        hf_token = None
    else:
        print("‚úì Using HF_TOKEN for authenticated access (faster downloads)")
else:
    print("‚ö†Ô∏è  No HF_TOKEN found - using unauthenticated access (slower)")
    print("   Get a token at https://huggingface.co/settings/tokens to speed up downloads")

# Load dataset with token if available
dataset = load_dataset("ai4privacy/pii-masking-200k", token=hf_token if hf_token else None)
train_data = dataset["train"]

print(f"\nDataset loaded successfully!")
print(f"Total number of samples: {len(train_data)}")
print(f"\nColumn names: {train_data.column_names}")

Loading dataset...
‚úì Using HF_TOKEN for authenticated access (faster downloads)

Dataset loaded successfully!
Total number of samples: 209261

Column names: ['source_text', 'target_text', 'privacy_mask', 'span_labels', 'mbert_text_tokens', 'mbert_bio_labels', 'id', 'language', 'set']


## Sample Random Entries

Filtering the dataset to English samples only, then selecting 5 random samples with a fixed seed for reproducibility.

In [13]:
# Set random seed for reproducibility
random.seed(42)

# Filter dataset to English only
print("Filtering dataset to English samples only...")
english_data = train_data.filter(lambda x: x['language'] == 'en')
print(f"English samples: {len(english_data)} out of {len(train_data)} total")

# Sample 5 random indices from English data
sample_indices = random.sample(range(len(english_data)), 5)
print(f"\nSelected indices from English dataset: {sample_indices}")

# Get the samples
samples = [english_data[i] for i in sample_indices]
print(f"Successfully sampled {len(samples)} English entries")

# Verify all samples are English
languages = [s['language'] for s in samples]
print(f"Languages: {languages}")
assert all(lang == 'en' for lang in languages), "Not all samples are English!"

Filtering dataset to English samples only...
English samples: 43501 out of 209261 total

Selected indices from English dataset: [41905, 7296, 1639, 18024, 16049]
Successfully sampled 5 English entries
Languages: ['en', 'en', 'en', 'en', 'en']


## Examine the Samples

Let's look at each sample to understand the data structure.

In [14]:
for i, sample in enumerate(samples, 1):
    print(f"\n{'='*80}")
    print(f"SAMPLE {i} (ID: {sample['id']}, Language: {sample['language']})")
    print(f"{'='*80}")
    
    print(f"\nüìÑ SOURCE TEXT (with PII):")
    print(f"{sample['source_text']}")
    
    print(f"\nüîí TARGET TEXT (masked):")
    print(f"{sample['target_text']}")
    
    print(f"\nüè∑Ô∏è  GROUND TRUTH PII:")
    for pii_item in sample['privacy_mask']:
        print(f"  - {pii_item['label']:20s} : '{pii_item['value']}' (pos {pii_item['start']}-{pii_item['end']})")


SAMPLE 1 (ID: 207666, Language: en)

üìÑ SOURCE TEXT (with PII):
Analyst, we need you to investigate further about the Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4 case. Draw insights from this situation and propose how it can affect our company policies.

üîí TARGET TEXT (masked):
[JOBTYPE], we need you to investigate further about the [USERAGENT] case. Draw insights from this situation and propose how it can affect our company policies.

üè∑Ô∏è  GROUND TRUTH PII:
  - JOBTYPE              : 'Analyst' (pos 0-7)
  - USERAGENT            : 'Mozilla/5.0 (Windows NT 5.1; WOW64; rv:12.2) Gecko/20100101 Firefox/12.2.4' (pos 54-128)

SAMPLE 2 (ID: 173057, Language: en)

üìÑ SOURCE TEXT (with PII):
Our main blockchain consultant identified 13eBdNVhaUZ6RPaCbiD1jPWVnEKaRCUU and 0xffff97e7cbbdc819da9d6ae8ef80deaf4bf39cec as two significant cryptocurrency addresses to monitor considering their possible impact on our business continuity.

üîí TARGET TEXT (masked)

## Setup OpenAI API

Configure the OpenAI client using the API key from environment variables.

In [15]:
# Get API key from environment variable
api_key = os.getenv("OPENAI_API_KEY")

# Validate API key
if not api_key:
    raise ValueError("OPENAI_API_KEY not found in environment variables.\n\n")

# Check if it's still the placeholder value from .env.example
if api_key.startswith("your-") or "placeholder" in api_key.lower() or "example" in api_key.lower():
    raise ValueError("OPENAI_API_KEY appears to be the placeholder value from .env.example.\n\n")

# Initialize OpenAI client
client = OpenAI(api_key=api_key)
print("‚úì OpenAI client initialized successfully")

‚úì OpenAI client initialized successfully


## Define PII Detection Prompt

Create a prompt that asks the LLM to identify all PII in the text.

In [16]:
def create_pii_detection_prompt(text):
    """Create a prompt for PII detection."""
    return f"""Identify all personally identifiable information (PII) in the following text.
               For each piece of PII found, provide:
               1. The exact text/value
               2. The type of PII (e.g., FIRSTNAME, EMAIL, PHONENUMBER, ADDRESS, SSN, etc.)
               3. The character position (start and end)

               Return your response as a JSON array where each item has: "value", "label", "start", "end"

               Text to analyze: {text}
               Only return the JSON array, no additional text."""

# Test the prompt with the first sample
test_prompt = create_pii_detection_prompt(samples[0]['source_text'])
print("Example prompt:")
print(test_prompt[:500] + "..." if len(test_prompt) > 500 else test_prompt)

Example prompt:
Identify all personally identifiable information (PII) in the following text.
               For each piece of PII found, provide:
               1. The exact text/value
               2. The type of PII (e.g., FIRSTNAME, EMAIL, PHONENUMBER, ADDRESS, SSN, etc.)
               3. The character position (start and end)

               Return your response as a JSON array where each item has: "value", "label", "start", "end"

               Text to analyze: Analyst, we need you to investigate furth...


## Send Samples to OpenAI for PII Detection

Process each sample and get PII detection results from OpenAI.

In [17]:
results = []

for i, sample in enumerate(samples, 1):
    print(f"\n{'='*80}")
    print(f"Processing Sample {i}/{len(samples)}...")
    print(f"{'='*80}")
    
    # Create the prompt
    prompt = create_pii_detection_prompt(sample['source_text'])
    
    # Call OpenAI API
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",  # or "gpt-4" for better accuracy
            messages=[
                {"role": "system", "content": "You are a PII detection expert. Return only valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,  # deterministic output
            response_format={"type": "json_object"}  # ensure JSON response
        )
        
        # Extract the response
        llm_response = response.choices[0].message.content
        print(f"\nü§ñ OpenAI Response:")
        print(llm_response)
        
        # Try to parse as JSON
        try:
            detected_pii = json.loads(llm_response)
        except json.JSONDecodeError:
            print("‚ö†Ô∏è  Warning: Could not parse response as JSON")
            detected_pii = {"error": "Invalid JSON response"}
        
        # Store result
        results.append({
            "sample_id": sample['id'],
            "source_text": sample['source_text'],
            "ground_truth": sample['privacy_mask'],
            "detected_pii": detected_pii
        })
        
    except Exception as e:
        print(f"‚ùå Error calling OpenAI API: {e}")
        results.append({
            "sample_id": sample['id'],
            "source_text": sample['source_text'],
            "ground_truth": sample['privacy_mask'],
            "error": str(e)
        })



Processing Sample 1/5...

ü§ñ OpenAI Response:
{}

Processing Sample 2/5...

ü§ñ OpenAI Response:
{
  "PII": []
}

Processing Sample 3/5...

ü§ñ OpenAI Response:
{
  "PII": []
}

Processing Sample 4/5...

ü§ñ OpenAI Response:
{
  "PII": [
    {
      "value": "Adella",
      "label": "FIRSTNAME",
      "start": 6,
      "end": 12
    },
    {
      "value": "29-977557-300592-0",
      "label": "SSN",
      "start": 56,
      "end": 75
    }
  ]
}

Processing Sample 5/5...

ü§ñ OpenAI Response:
{
  "PII": []
}


‚úì Processed all 5 samples
