# Custom Token Definition Guide

This notebook demonstrates how to create custom token definitions with minimal code using the OpenToken notebook helpers.

## Setup

First, import the necessary modules:

In [None]:
from opentoken.notebook_helpers import (
    TokenBuilder,
    CustomTokenDefinition,
    create_token_generator,
    quick_token,
    list_attributes,
    expression_help
)
from opentoken.attributes.general.record_id_attribute import RecordIdAttribute
from opentoken.attributes.person.first_name_attribute import FirstNameAttribute
from opentoken.attributes.person.last_name_attribute import LastNameAttribute
from opentoken.attributes.person.birth_date_attribute import BirthDateAttribute
from opentoken.attributes.person.sex_attribute import SexAttribute
from opentoken.attributes.person.postal_code_attribute import PostalCodeAttribute

## View Available Attributes

Check what attributes are available:

In [None]:
attrs = list_attributes()
print("Available attributes:")
for name in attrs.keys():
    print(f"  - {name}")

## Expression Syntax Help

Get help on expression syntax:

In [None]:
print(expression_help())

## Method 1: Quick Token (Simplest)

Create a custom T6 token in one line with the rule:
`U(last-name)|U(first-name)|birth-date|postal-code-3|U(sex)`

In [None]:
# Create T6 token generator in one call
generator = quick_token(
    "T6",
    [
        ("last_name", "T|U"),
        ("first_name", "T|U"),
        ("birth_date", "T|D"),
        ("postal_code", "T|S(0,3)"),
        ("sex", "T|U")
    ],
    "my-hashing-secret",
    "my-32-character-encryption-key!"
)

# Test it with sample data
person_attrs = {
    RecordIdAttribute: "1",
    FirstNameAttribute: "John",
    LastNameAttribute: "Doe",
    BirthDateAttribute: "1990-01-15",
    SexAttribute: "Male",
    PostalCodeAttribute: "98101"
}

result = generator.get_all_tokens(person_attrs)
print(f"T6 Token: {result.tokens.get('T6')}")

## Method 2: Token Builder (More Flexible)

Use the fluent TokenBuilder API for more control:

In [None]:
# Create a custom T6 token
t6_token = TokenBuilder("T6") \
    .add("last_name", "T|U") \
    .add("first_name", "T|U") \
    .add("birth_date", "T|D") \
    .add("postal_code", "T|S(0,3)") \
    .add("sex", "T|U") \
    .build()

# Create a custom token definition
custom_definition = CustomTokenDefinition().add_token(t6_token)

# Create token generator
generator = create_token_generator(
    "my-hashing-secret",
    "my-32-character-encryption-key!",
    custom_definition
)

# Generate tokens
result = generator.get_all_tokens(person_attrs)
print(f"T6 Token: {result.tokens.get('T6')}")

## Method 3: Multiple Custom Tokens

Define multiple tokens in one definition:

In [None]:
# Create T6 token: U(last-name)|U(first-name)|birth-date|postal-code-3|U(sex)
t6_token = TokenBuilder("T6") \
    .add("last_name", "T|U") \
    .add("first_name", "T|U") \
    .add("birth_date", "T|D") \
    .add("postal_code", "T|S(0,3)") \
    .add("sex", "T|U") \
    .build()

# Create T7 token: U(last-name-3)|U(first-name-3)|birth-date
t7_token = TokenBuilder("T7") \
    .add("last_name", "T|S(0,3)|U") \
    .add("first_name", "T|S(0,3)|U") \
    .add("birth_date", "T|D") \
    .build()

# Add both tokens to definition
custom_definition = CustomTokenDefinition() \
    .add_token(t6_token) \
    .add_token(t7_token)

# Create generator
generator = create_token_generator(
    "my-hashing-secret",
    "my-32-character-encryption-key!",
    custom_definition
)

# Generate both tokens
result = generator.get_all_tokens(person_attrs)
print(f"T6 Token: {result.tokens.get('T6')}")
print(f"T7 Token: {result.tokens.get('T7')}")

## Using Custom Tokens with PySpark DataFrames

The key is to pass your custom TokenDefinition to the OpenTokenProcessor:

In [None]:
from pyspark.sql import SparkSession
from opentoken_pyspark import OpenTokenProcessor
from opentoken.notebook_helpers import TokenBuilder, CustomTokenDefinition

# Create Spark session
spark = SparkSession.builder.appName("CustomTokens").getOrCreate()

# Create sample DataFrame
data = [
    ("1", "John", "Doe", "1990-01-15", "Male", "98101"),
    ("2", "Jane", "Smith", "1985-06-20", "Female", "94105")
]
df = spark.createDataFrame(data, ["RecordId", "FirstName", "LastName", "BirthDate", "Sex", "PostalCode"])

# Step 1: Define your custom T6 token
t6_token = TokenBuilder("T6") \
    .add("last_name", "T|U") \
    .add("first_name", "T|U") \
    .add("birth_date", "T|D") \
    .add("postal_code", "T|S(0,3)") \
    .add("sex", "T|U") \
    .build()

# Step 2: Create custom token definition
custom_definition = CustomTokenDefinition().add_token(t6_token)

# Step 3: Create processor with custom definition
processor = OpenTokenProcessor(
    hashing_secret="my-hashing-secret",
    encryption_key="my-32-character-encryption-key!",
    token_definition=custom_definition  # Pass custom definition here!
)

# Step 4: Process DataFrame with custom tokens
tokens_df = processor.process_dataframe(df)

# Show results - you'll see T6 tokens instead of T1-T5!
print("Custom T6 tokens generated:")
tokens_df.show(truncate=False)

### Multiple Custom Tokens with PySpark

You can also use multiple custom tokens:

In [None]:
# Define multiple custom tokens
t6_token = TokenBuilder("T6") \
    .add("last_name", "T|U") \
    .add("first_name", "T|U") \
    .add("birth_date", "T|D") \
    .add("postal_code", "T|S(0,3)") \
    .add("sex", "T|U") \
    .build()

t7_token = TokenBuilder("T7") \
    .add("last_name", "T|S(0,3)|U") \
    .add("first_name", "T|S(0,3)|U") \
    .add("birth_date", "T|D") \
    .build()

# Add both to definition
multi_definition = CustomTokenDefinition() \
    .add_token(t6_token) \
    .add_token(t7_token)

# Create processor with multiple custom tokens
processor_multi = OpenTokenProcessor(
    hashing_secret="my-hashing-secret",
    encryption_key="my-32-character-encryption-key!",
    token_definition=multi_definition
)

# Process - will generate both T6 and T7 tokens
multi_tokens_df = processor_multi.process_dataframe(df)

print("Multiple custom tokens (T6 and T7):")
multi_tokens_df.show(truncate=False)

## Experiment with Different Rules

Try different combinations:

In [None]:
# Minimal token: just last name and first initial
minimal_token = TokenBuilder("MINIMAL") \
    .add("last_name", "T|U") \
    .add("first_name", "T|S(0,1)|U") \
    .build()

# Full token: everything
full_token = TokenBuilder("FULL") \
    .add("last_name", "T|U") \
    .add("first_name", "T|U") \
    .add("birth_date", "T|D") \
    .add("sex", "T|U") \
    .add("postal_code", "T|S(0,5)") \
    .add("ssn", "T") \
    .build()

# Create definition with both
definition = CustomTokenDefinition() \
    .add_token(minimal_token) \
    .add_token(full_token)

generator = create_token_generator(
    "my-hashing-secret",
    "my-32-character-encryption-key!",
    definition
)

result = generator.get_all_tokens(person_attrs)
print(f"Minimal Token: {result.tokens.get('MINIMAL')}")
print(f"Full Token: {result.tokens.get('FULL')}")

## Summary

Three ways to create custom tokens:

1. **`quick_token()`** - Fastest, one-liner approach for simple cases
2. **`TokenBuilder`** - Fluent API for readable, flexible definitions
3. **Manual classes** - Full control (as shown in previous examples)

Choose the method that best fits your workflow!