# Dataset Overlap Analysis Guide

This notebook demonstrates how to use the `OpenTokenOverlapAnalyzer` class to identify matching records between two tokenized datasets.

## Overview

The OpenTokenOverlapAnalyzer helps you:
- Find matching records across two datasets based on encrypted tokens
- Define flexible matching rules (which token types must match)
- Compare overlap rates using different matching criteria
- Get detailed statistics and matched record pairs

## Setup

In [None]:
from pyspark.sql import SparkSession
from opentoken_pyspark import OpenTokenProcessor, OpenTokenOverlapAnalyzer

# Create Spark session
spark = SparkSession.builder \
    .appName("OverlapAnalysisExample") \
    .master("local[*]") \
    .getOrCreate()

# Set secrets (use the same secrets for both datasets!)
HASHING_SECRET = "my-hashing-secret-key"
ENCRYPTION_KEY = "my-encryption-key-32-characters!"  # Must be 32 characters

## Create Sample Datasets

Let's create two datasets representing patient records from different hospitals.

In [None]:
# Hospital A data
hospital_a_data = [
    ("A001", "John", "Doe", "1990-01-15", "Male", "98101", "123-45-6789"),
    ("A002", "Jane", "Smith", "1985-06-20", "Female", "94105", "987-65-4321"),
    ("A003", "Bob", "Johnson", "1978-03-10", "Male", "02134", "456-78-9123"),
    ("A004", "Alice", "Williams", "1992-11-05", "Female", "10001", "321-54-9876"),
    ("A005", "Charlie", "Brown", "1988-08-22", "Male", "60614", "789-12-3456"),
]

hospital_a_df = spark.createDataFrame(
    hospital_a_data,
    ["RecordId", "FirstName", "LastName", "BirthDate", "Sex", "PostalCode", "SocialSecurityNumber"]
)

print("Hospital A Records:")
hospital_a_df.show()

In [None]:
# Hospital B data (has some overlapping patients)
hospital_b_data = [
    ("B101", "John", "Doe", "1990-01-15", "Male", "98101", "123-45-6789"),     # Same as A001
    ("B102", "Jane", "Smith", "1985-06-20", "Female", "94105", "987-65-4321"),  # Same as A002
    ("B103", "David", "Lee", "1995-04-18", "Male", "30303", "654-32-1098"),     # Unique to B
    ("B104", "Emma", "Davis", "1982-12-30", "Female", "90210", "234-56-7890"),  # Unique to B
]

hospital_b_df = spark.createDataFrame(
    hospital_b_data,
    ["RecordId", "FirstName", "LastName", "BirthDate", "Sex", "PostalCode", "SocialSecurityNumber"]
)

print("Hospital B Records:")
hospital_b_df.show()

## Generate Tokens for Both Datasets

In [None]:
# Initialize token processor
processor = OpenTokenProcessor(HASHING_SECRET, ENCRYPTION_KEY)

# Generate tokens for Hospital A
hospital_a_tokens = processor.process_dataframe(hospital_a_df)
print("Hospital A Tokens:")
hospital_a_tokens.show(5, truncate=False)

In [None]:
# Generate tokens for Hospital B
hospital_b_tokens = processor.process_dataframe(hospital_b_df)
print("Hospital B Tokens:")
hospital_b_tokens.show(5, truncate=False)

## Analyze Dataset Overlap

### Method 1: Single Rule Set

Let's find matches requiring T1 and T2 tokens to match.

In [None]:
# Initialize overlap analyzer with the same encryption key
analyzer = OpenTokenOverlapAnalyzer(ENCRYPTION_KEY)

# Analyze overlap with T1 and T2 matching rules
results = analyzer.analyze_overlap(
    hospital_a_tokens,
    hospital_b_tokens,
    matching_rules=["T1", "T2"],
    dataset1_name="Hospital_A",
    dataset2_name="Hospital_B"
)

# Print summary
analyzer.print_summary(results)

### View Matched Record Pairs

In [None]:
# Show matched record pairs
print("Matched Record Pairs:")
results['matches'].show()

# Expected: Should show A001<->B101 and A002<->B102

In [None]:
# Debug: Check if decryption is working in the analyzer
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Test the decrypt method manually
analyzer = OpenTokenOverlapAnalyzer(ENCRYPTION_KEY)
test_token = hospital_a_tokens.select("Token").filter("RuleId == 'T1'").first()["Token"]
print(f"Sample Token: {test_token}")
print(f"Decrypted: {analyzer._decrypt_token(test_token)}")

# Check if UDF is working
decrypt_udf = udf(analyzer._decrypt_token, StringType())
hospital_a_tokens.withColumn("Decrypted", decrypt_udf("Token")).select("RuleId", "Token", "Decrypted").show(5, truncate=False)

### Method 2: Compare Multiple Rule Sets

Let's see how overlap changes with different matching criteria.

In [None]:
# Define different rule sets to compare
rule_sets = [
    ["T1"],              # Least strict: only T1 must match
    ["T1", "T2"],        # Medium: T1 AND T2 must match
    ["T1", "T2", "T3"],  # Strict: T1 AND T2 AND T3 must match
    ["T1", "T2", "T3", "T4"],  # Very strict: all 4 tokens must match
]

# Run comparison
multi_results = analyzer.compare_with_multiple_rules(
    hospital_a_tokens,
    hospital_b_tokens,
    rule_sets,
    dataset1_name="Hospital_A",
    dataset2_name="Hospital_B"
)

# Display comparison
print("\nOverlap Comparison Across Different Matching Rules:")
print("=" * 70)
for result in multi_results:
    rules_str = ", ".join(result['matching_rules'])
    print(f"Rules: {rules_str:20} | "
          f"Matches: {result['matching_records_dataset1']:2} | "
          f"Overlap: {result['overlap_percentage']:5.1f}%")

### Access Detailed Statistics

In [None]:
# Access detailed statistics from any result
result = multi_results[1]  # T1 + T2 result

print(f"Dataset: {result['dataset1_name']}")
print(f"  Total records: {result['total_records_dataset1']}")
print(f"  Matching records: {result['matching_records_dataset1']}")
print(f"  Unique records: {result['unique_to_dataset1']}")
print()
print(f"Dataset: {result['dataset2_name']}")
print(f"  Total records: {result['total_records_dataset2']}")
print(f"  Matching records: {result['matching_records_dataset2']}")
print(f"  Unique records: {result['unique_to_dataset2']}")

## Use Case: Analyzing Data Sharing Potential

Let's say you want to assess whether two institutions should establish a data sharing agreement.

In [None]:
# Analyze with strict matching criteria
strict_results = analyzer.analyze_overlap(
    hospital_a_tokens,
    hospital_b_tokens,
    matching_rules=["T1", "T2", "T3"],  # Require 3 token types to match
    dataset1_name="Hospital_A",
    dataset2_name="Hospital_B"
)

# Decision logic
overlap_pct = strict_results['overlap_percentage']
unique_a = strict_results['unique_to_dataset1']
unique_b = strict_results['unique_to_dataset2']

print("\n" + "=" * 70)
print("DATA SHARING ASSESSMENT")
print("=" * 70)
print(f"Overlap: {overlap_pct:.1f}%")
print(f"Unique records in Hospital A: {unique_a}")
print(f"Unique records in Hospital B: {unique_b}")
print()

if overlap_pct < 10:
    print("✓ RECOMMENDATION: High potential for data sharing")
    print("  Minimal overlap suggests complementary patient populations.")
elif overlap_pct < 30:
    print("✓ RECOMMENDATION: Moderate potential for data sharing")
    print("  Some overlap exists but substantial unique populations in each dataset.")
else:
    print("⚠ RECOMMENDATION: Review data sharing need carefully")
    print("  Significant overlap may reduce value of data sharing.")

## Cleanup

In [None]:
# Stop Spark session
spark.stop()

## Key Takeaways

1. **Encryption Key**: Must be the same key used to generate tokens
2. **Matching Rules**: Define which token types must match (ALL must match, not just one)
3. **Multiple Criteria**: Use `compare_with_multiple_rules()` to see how overlap changes with stricter matching
4. **Privacy**: Analysis uses encrypted tokens - no PHI is exposed
5. **Results**: Get both statistics and detailed matched record pairs

## Next Steps

- Try with your own datasets
- Experiment with different token rule combinations
- Use custom token definitions (see Custom_Token_Definition_Guide.ipynb)
- Scale to larger datasets using your Spark cluster