# OpenToken PySpark Example

This notebook demonstrates how to use the OpenToken PySpark bridge to generate privacy-preserving tokens from a PySpark DataFrame.

## Prerequisites

1. Install the required packages:
   ```bash
   cd lib/python
   pip install -e .
   cd ../python-pyspark
   pip install -e .
   ```

2. Ensure you have PySpark and Jupyter installed

## Setup

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from opentoken_pyspark import OpenTokenProcessor
import pandas as pd

In [None]:
# Create a Spark session
spark = SparkSession.builder \
    .appName("OpenTokenExample") \
    .master("local[*]") \
    .getOrCreate()

print(f"Spark version: {spark.version}")

## Load Sample Data

We'll load the sample CSV data into a PySpark DataFrame.

In [None]:
# Load sample data from CSV
sample_csv_path = "../../../../resources/sample.csv"

df = spark.read.csv(
    sample_csv_path,
    header=True,
    inferSchema=True
)

# Display the schema
print("Input DataFrame Schema:")
df.printSchema()

# Show first few rows
print("\nFirst 5 rows:")
df.show(5, truncate=False)

## Initialize OpenToken Processor

Create an instance of the OpenTokenProcessor with your hashing secret and encryption key.

**Note:** The secrets used here are for demonstration purposes only. In production, use secure secrets management.

In [None]:
# Initialize the processor with secrets
processor = OpenTokenProcessor(
    hashing_secret="HashingKey",
    encryption_key="Secret-Encryption-Key-Goes-Here."
)

print("OpenToken Processor initialized successfully!")

## Generate Tokens

Process the DataFrame to generate tokens for each person record.

In [None]:
# Generate tokens
tokens_df = processor.process_dataframe(df)

# Display the schema of the result
print("Output DataFrame Schema:")
tokens_df.printSchema()

# Count total tokens generated
total_tokens = tokens_df.count()
print(f"\nTotal tokens generated: {total_tokens}")

## Inspect Results

Let's look at the generated tokens for a specific record.

In [None]:
# Show tokens for the first record
first_record_id = df.select("RecordId").first()[0]
print(f"Tokens for RecordId: {first_record_id}")

tokens_df.filter(tokens_df.RecordId == first_record_id).show(truncate=False)

## Analyze Token Distribution

Check how many tokens were generated per rule.

In [None]:
# Count tokens by RuleId
print("Token count by RuleId:")
tokens_df.groupBy("RuleId").count().orderBy("RuleId").show()

## Convert to Pandas for Visualization

For smaller datasets, you can convert to Pandas for easier visualization.

In [None]:
# Convert a subset to Pandas for visualization
sample_tokens = tokens_df.limit(10).toPandas()
print("Sample tokens as Pandas DataFrame:")
display(sample_tokens)

## Save Results

Save the tokens to a Parquet file for further processing.

In [None]:
# Save to Parquet
output_path = "../output/tokens_output.parquet"

tokens_df.write.mode("overwrite").parquet(output_path)
print(f"Tokens saved to: {output_path}")

## Example: Using Alternative Column Names

OpenToken supports alternative column names for flexibility.

In [None]:
# Create a DataFrame with alternative column names
alt_data = [
    {
        "Id": "custom-001",
        "GivenName": "Alice",
        "Surname": "Johnson",
        "ZipCode": "98052",
        "Gender": "Female",
        "DateOfBirth": "1990-05-15",
        "NationalIdentificationNumber": "234-56-7890"
    }
]

alt_df = spark.createDataFrame(alt_data)

# Process with alternative column names
alt_tokens_df = processor.process_dataframe(alt_df)

print("Tokens generated with alternative column names:")
alt_tokens_df.show(truncate=False)

## Performance Considerations

For large datasets, PySpark processes data in parallel across the cluster.

In [None]:
# Check the number of partitions
print(f"Number of partitions in input DataFrame: {df.rdd.getNumPartitions()}")
print(f"Number of partitions in output DataFrame: {tokens_df.rdd.getNumPartitions()}")

## Cleanup

Stop the Spark session when done.

In [None]:
# Stop Spark session
# spark.stop()
print("Session complete. Uncomment the line above to stop Spark.")

## Summary

This notebook demonstrated:

1. Loading person data into a PySpark DataFrame
2. Initializing the OpenToken processor with secrets
3. Generating privacy-preserving tokens for each record
4. Analyzing and visualizing the results
5. Saving tokens for further use

The PySpark bridge enables distributed token generation for large-scale person matching workflows.