# Sensitive Data Detection Tutorial

This notebook demonstrates the complete workflow of the sensitive data detection system, including:
1. Data Processing with the data_processor module
2. Language Detection with the language_detection module  
3. PII and Non-PII Detection with the detect_reflect module
4. Free Text Analysis with the free_text module

## Setup

First, let's set up our environment and import necessary modules. Make sure you have created a `.env` file in the root directory with your API keys:

```
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_API_KEY=your_huggingface_api_key_here
```


In [1]:
import os
from dotenv import load_dotenv
import pandas as pd
from pathlib import Path
import json

# Load environment variables
load_dotenv()

# Add the project root to Python path
import sys
sys.path.append(str(Path.cwd().parent))

print("Environment setup complete!")
print(f"Current working directory: {Path.cwd()}")
print(f"Project root added to path: {Path.cwd().parent}")

MODEL_NAME = "gpt-4o-mini"

Environment setup complete!
Current working directory: /Users/liangtelkamp/Documents/GitHub/sensitive-data-detection/notebooks
Project root added to path: /Users/liangtelkamp/Documents/GitHub/sensitive-data-detection


## 1. Data Processing

Let's start by loading and processing our data using the data processor module. We'll use the dummy.csv file in the data directory.


In [2]:
from modules.data_processor import DataLoader

# Initialize the data loader
data_loader = DataLoader()

# Load the dummy data
data_path = "../data/dummy.csv"
loaded_data = data_loader.load_data(data_path)

# Display the structure of loaded data
print("Loaded data structure:")
for table_name, table_data in loaded_data.items():
    print(f"\nTable: {table_name}")
    print(f"Columns: {list(table_data['columns'].keys())}")
    print(f"Metadata: {table_data.get('metadata', {})}")
    
    # Show sample data from first few columns
    for i, (col_name, col_data) in enumerate(list(table_data['columns'].items())[:3]):
        print(f"  {col_name}: {col_data['records'][:5]}...")  # First 5 records
        if i >= 2:  # Limit to first 3 columns for display
            break


Loaded data structure:

Table: dummy
Columns: ['report_date', 'location', 'access_level', 'service_coverage', 'population_total', 'facility_type', 'activity_type', 'region_name', 'access_constraints', 'group_vulnerability_map', 'incident_reports']
Metadata: {'country': None, 'country_info': {'raw_country': None, 'standardized_name': None, 'alpha_2': None, 'alpha_3': None, 'standardization_confidence': 0.0, 'extraction_method': 'not_found', 'extracted_from_filename': False}, 'filename': 'dummy.csv', 'filepath': '../data/dummy.csv', 'table_name': 'dummy', 'file_extension': '.csv', 'file_size_bytes': 1402, 'processing_timestamp': '2025-06-17T16:15:09.098812', 'total_columns': 11, 'max_records_per_column': 20, 'column_names': ['report_date', 'location', 'access_level', 'service_coverage', 'population_total', 'facility_type', 'activity_type', 'region_name', 'access_constraints', 'group_vulnerability_map', 'incident_reports'], 'column_types': {'report_date': 'object', 'location': 'object', '

## 2. Language Detection

Now let's detect the language of the content in our data using the language detection module.


In [3]:
from modules.language_detection import LanguageDetector

# Initialize the language detector
lang_detector = LanguageDetector()

# Detect language for the file
try:
    detected_language = lang_detector.detect_language(data_path)
    print(f"Detected language for {data_path}: {detected_language}")
except Exception as e:
    print(f"Language detection failed: {e}")
    detected_language = "unknown"

# Store language info in our data structure
for table_name, table_data in loaded_data.items():
    if 'metadata' not in table_data:
        table_data['metadata'] = {}
    table_data['metadata']['detected_language'] = detected_language
    
print(f"\nLanguage detection complete. Detected language: {detected_language}")


Detected language for ../data/dummy.csv: en

Language detection complete. Detected language: en


## 3a. Setup LLM for generation

In [4]:
from modules.llm_model.model import Model

# Initialize the model
model = Model(model_name=MODEL_NAME)

# Check if the model is ready
print(f"Model {model.model_name} is ready: {model.is_ready()}")

# Get model components
model_components = model.get_model_components()
print(f"Model components: {model_components}")

OpenAI client initialized
Model gpt-4o-mini is ready: True
Model components: (None, None, <openai.OpenAI object at 0x134049750>, 'openai')


## 3. PII and Non-PII Detection with Detect-Reflect

Now we'll use the detect_reflect module to identify both PII and non-PII sensitive information. This requires setting up a classifier that uses LLM models.


In [5]:
from modules.detect_reflect import SensitivityClassifier, detect_and_reflect_pii, detect_non_pii


# Initialize with real classifier
sensitivity_classifier = SensitivityClassifier(
    model_name=MODEL_NAME  # or "gpt-4" if you have access
)
print(f"Initialized SensitivityClassifier with model: {sensitivity_classifier.model_name}")

print("Classifier setup complete!")


OpenAI client initialized
Initialized SensitivityClassifier with model: gpt-4o-mini
Classifier setup complete!


### 3.1 PII Detection and Reflection


In [6]:
# Perform PII detection and reflection
print("Starting PII detection and reflection...")

for table_name, table_data in loaded_data.items():
    print(f"\nProcessing table: {table_name}")
    
    # Apply PII detection and reflection
    table_data = detect_and_reflect_pii(table_data, sensitivity_classifier)
    
    # Display results
    print("PII Detection Results:")
    for col_name, col_data in table_data['columns'].items():
        pii_detection_key = f"pii_detection_{sensitivity_classifier.model_name}"
        pii_reflection_key = f"pii_reflection_{sensitivity_classifier.model_name}"
        
        pii_type = col_data.get(pii_detection_key, "Not analyzed")
        sensitivity = col_data.get(pii_reflection_key, "Not analyzed")
        
        print(f"  {col_name}:")
        print(f"    PII Type: {pii_type}")
        print(f"    Sensitivity: {sensitivity}")

print("\nPII detection and reflection complete!")


Starting PII detection and reflection...

Processing table: dummy
PII Detection Results:
  report_date:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  location:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  access_level:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  service_coverage:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  population_total:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  facility_type:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  activity_type:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  region_name:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  access_constraints:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  group_vulnerability_map:
    PII Type: None
    Sensitivity: NON_SENSITIVE
  incident_reports:
    PII Type: None
    Sensitivity: NON_SENSITIVE

PII detection and reflection complete!


### 3.2 Non-PII Sensitivity Detection


In [17]:
# Perform Non-PII sensitivity detection
print("Starting Non-PII sensitivity detection...")

for table_name, table_data in loaded_data.items():
    print(f"\nProcessing table: {table_name}")
    
    # Apply non-PII detection (column-level analysis)
    table_data = detect_non_pii(table_data, sensitivity_classifier, table_name, method='table')
    
    # Display results
    print("Non-PII Sensitivity Results:")
    print(table_data['metadata'][f'non_pii_{MODEL_NAME}'])
    print(table_data['metadata'][f'non_pii_{MODEL_NAME}_explanation'])
    
    # Check ISP used
    isp_used = table_data.get('metadata', {}).get('isp_used', 'Unknown')
    print(f"\n  ISP Context Used: {isp_used}")

print("\nNon-PII sensitivity detection complete!")


Starting Non-PII sensitivity detection...

Processing table: dummy
Non-PII Sensitivity Results:
MODERATE_SENSITIVE
- Sensitivity Classification: MODERATE_SENSITIVE
- Sensitive Columns: location, access_level, service_coverage, population_total, facility_type, activity_type, region_name, access_constraints, group_vulnerability_map, incident_reports
- Cited ISP Rule(s): The table contains aggregated survey results that are potentially sensitive at the district level, as indicated by the presence of location, population data, and vulnerability assessments. This aligns with the MODERATE_SENSITIVE classification under the ISP, specifically the rule: "Aggregated survey results (e.g. aggregated to the district level)." The data includes indicators of service coverage and group vulnerabilities, which could reveal sensitive information about specific populations in the districts.

  ISP Context Used: default

Non-PII sensitivity detection complete!


## 4. Save results to json file

In [27]:
from modules.detect_reflect.utils import save_json_data

save_json_data(loaded_data, '../data/dummy_results.json')

## Summary

In this tutorial, we've demonstrated the complete workflow of the sensitive data detection system:

1. **Data Processing**: We loaded and processed the dummy.csv file using the `DataLoader` class, which structured the data and extracted metadata including country information.

2. **Language Detection**: We detected the primary language of the data using the `LanguageDetector` class.

3. **PII and Non-PII Detection**: We used the `detect_reflect` module to:
   - Identify PII entities in each column
   - Reflect on the sensitivity level of identified PII
   - Detect non-PII sensitive information using ISP (Information Sharing Policy) context

4. **Free Text Analysis**: We identified which columns contain free text data and optionally analyzed them for PII content using the `FreeTextDetector`.

### Key Features:

- **Modular Design**: Each step is handled by a specialized module
- **API Key Management**: Uses environment variables for secure API key storage
- **Mock Mode**: Provides demonstration capabilities even without API keys
- **Comprehensive Analysis**: Covers both structured data analysis and free text detection
- **ISP Context**: Uses location-based Information Sharing Policies for context-aware analysis

### Next Steps:

1. Set up your OpenAI API key in the `.env` file for full functionality
2. Try the workflow with your own data files
3. Experiment with different ISP contexts for non-PII detection
4. Customize the detection parameters for your specific use case

The results from each step provide detailed insights for making informed decisions about data handling and privacy protection measures.


## Environment Setup Instructions

To get full functionality from this tutorial, create a `.env` file in the project root directory with the following content:

```bash
# API Keys for Sensitive Data Detection Tutorial
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_API_KEY=your_huggingface_api_key_here

```

You can create this file by running the following command in your terminal from the project root:

```bash
cat > .env << 'EOF'
OPENAI_API_KEY=your_openai_api_key_here
HUGGINGFACE_API_KEY=your_huggingface_api_key_here
EOF
```

Then replace `your_openai_api_key_here` with your actual OpenAI API key.
