# Edgy BAU Edge Processing Validation

This notebook validates the CashConnectedGraphChangeEvent generation for BAU (live) edge processing through the Edgy service by comparing the events generated against expected events computed from Duplograph's graph state.

## Overview

The validation process:
1. Takes a list of edges as input containing:
   - Source account (CASH_CUSTOMER) token
   - Target node type and token
   - Edge effective timestamp
2. For each edge:
   - Finds the source account holder
   - Queries their active (non-merged) accounts
   - Identifies other account holders connected to the same asset type via active (non-merged) acconuts
   - Computes expected connection change events based on shared assets and Edgy's deduplication logic
   - Validates generated event IDs against expected events to ensure all expected events are present, and no unexpected events are generated
   - Validates the source and target user labels attached to the events
   - Validates the timestamp of the events matches the earliest connection between the source and target user
3. Performs general sanity checks on the generated events

## Setup

First, install required dependencies and import necessary modules.

In [0]:
%pip install --extra-index-url https://artifactory.global.square/artifactory/api/pypi/block-pypi/simple sq-pysnowflake==1.6.0

Python interpreter will be restarted.
Looking in indexes: https://pypi.org/simple, https://artifactory.global.square/artifactory/api/pypi/block-pypi/simple
Python interpreter will be restarted.


In [0]:
import pandas as pd
import numpy as np
import csv
import fcntl
import json
import os
import time
import math
from datetime import datetime
from typing import List, Set, Dict, Tuple, Optional
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures
from pyspark.sql.functions import col
from pysnowflake import Session
from IPython.display import display, HTML



## Helper functions

### Query helpers

In [0]:
warehouse_2xl = 'etl__2xlarge'
warehouse_large = 'etl__large'
warehouse_medium = 'etl__medium'

def snowflake_query(query: str) -> pd.DataFrame:
    """Execute a query in Snowflake and return results as a pandas DataFrame."""
    with Session(query_tag="ioanna", connection_override_args={'warehouse': warehouse_large}) as sess:
        cursor = sess.execute(query)
        return cursor.fetch_pandas_all()

def spark_query(query: str) -> pd.DataFrame:
    """Execute a query in Spark and return results as a pandas DataFrame."""
    return spark.sql(query).toPandas()
        
def to_sql_list(inputs):
    """Convert a set/list of values to SQL IN clause format"""
    if not inputs:
        return "()"
    return "('" + "','".join(str(x) for x in inputs) + "')"

### Validation Result Class
Tracks validation outcomes and error details for each edge.

In [0]:
class ValidationResult:
    def __init__(self, edge_id: str):
        self.edge_id = edge_id
        self.validation_status: Optional[str] = None
        self.missing_events: Set[str] = set()
        self.unexpected_events: Set[str] = set()
        self.label_errors: List[str] = []
        self.timestamp_errors: List[str] = []
        self.success: bool = True
        
    def add_error(self, error_type: str, details: str):
        """Records an error of the specified type with details"""
        self.success = False
        if error_type == "validation_status":
            self.validation_status = details
        elif error_type == "missing_event":
            self.missing_events.add(details)
        elif error_type == "unexpected_event":
            self.unexpected_events.add(details)
        elif error_type == "label":
            self.label_errors.append(details)
        elif error_type == "timestamp":
            self.timestamp_errors.append(details)
        else:
            print(f"unknown error type={error_type!r}, details={details!r}")
            
    def get_summary(self) -> str:
        """Generates a human-readable summary of validation results"""
        summary = []
        
        if self.validation_status:
            summary.append(f"ℹ️ {self.edge_id}: {self.validation_status}")
        elif self.success:
            summary.append(f"✅ {self.edge_id}: All validations passed")
        else:
            summary.append(f"❌ {self.edge_id}: Validation failed")
        
        if self.missing_events:
            summary.append(f"\nMissing events ({len(self.missing_events)})")
            for event in sorted(self.missing_events):
                summary.append(f"  - {event}")
        if self.unexpected_events:
            summary.append(f"\nUnexpected events ({len(self.unexpected_events)})")
            for event in sorted(self.unexpected_events):
                summary.append(f"  - {event}")
        if self.label_errors:
            summary.append(f"\nLabel errors ({len(self.label_errors)})")
            for error in self.label_errors:
                summary.append(f"  - {error}")
        if self.timestamp_errors:
            summary.append(f"\nTimestamp errors ({len(self.timestamp_errors)})")
            for error in self.timestamp_errors:
                summary.append(f"  - {error}")
        return "\n".join(summary)
    
    def write_to_csv(self, csv_path: str):
        """Writes this validation result to CSV in a thread-safe manner.
        
        Args:
            csv_path: Path to the CSV file
        """
        # Extract edge components from the edge_id
        # Format is "{source_account}/{asset_token}/{asset_type}/{effective_at_ms}"
        source_account, asset_token, asset_type, effective_at_ms = self.edge_id.split('/')
                
        # Prepare the row data
        row_data = {
            'edge_id': self.edge_id,
            'source_account': source_account,
            'asset_token': asset_token,
            'asset_type': asset_type,
            'effective_at_ms': effective_at_ms,
            'validation_success': self.success,
            'validation_status': self.validation_status if self.validation_status else 
                            ('PASS' if self.success else 'FAIL'),
            'missing_events': '|'.join(sorted(self.missing_events)) if self.missing_events else '',
            'unexpected_events': '|'.join(sorted(self.unexpected_events)) if self.unexpected_events else '',
            'label_errors': '|'.join(self.label_errors) if self.label_errors else '',
            'timestamp_errors': '|'.join(self.timestamp_errors) if self.timestamp_errors else ''
        }
        
        # Define the field names (column headers)
        fieldnames = [
            'edge_id',
            'source_account',
            'asset_token',
            'asset_type',
            'effective_at_ms',
            'validation_success',
            'validation_status',
            'missing_events',
            'unexpected_events',
            'label_errors',
            'timestamp_errors'
        ]
        
        # Use file locking for thread-safe writing
        file_exists = os.path.exists(csv_path)
        
        with open(csv_path, 'a' if file_exists else 'w', newline='') as f:
            # Get an exclusive lock on the file
            fcntl.flock(f.fileno(), fcntl.LOCK_EX)
            try:
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                
                # Write headers if this is a new file
                if not file_exists:
                    writer.writeheader()
                
                # Write the row
                writer.writerow(row_data)
                
                # Ensure the write is flushed to disk
                f.flush()
                os.fsync(f.fileno())
            finally:
                # Release the lock
                fcntl.flock(f.fileno(), fcntl.LOCK_UN)

### Event ID generation

In [0]:
def connection_change_event_id(target_token: str, source_token: str, node_type: str) -> str:
    """Produces a ConnectionChange event_id in the same format as edgy
    
    The event ID format ensures consistency regardless of processing direction:
    - Format: "E/<token1>/<token2>/<nodeType>"
    - token1 and token2 are alphabetically sorted
    """
    tokens = "/".join(sorted([target_token, source_token]))
    return f"E/{tokens}/{node_type}"

### Label validation
Functions to validate label propagation in events

In [0]:
def validate_labels(
    ccgce_df: pd.DataFrame,
    source_labels_df: pd.DataFrame,
    target_labels_df: pd.DataFrame,
    source_account_holder: str,
    validation_result: ValidationResult
) -> None:
    """Validates that labels are correctly propagated in CashConnectedGraphChangeEvents.
    
    This function checks:
    1. Source labels in events match the union of labels from all source account holder's accounts
    2. Target labels in events match the union of labels from all target account holder's accounts
    
    Note: ACCOUNT_DENYLISTED labels are excluded as they apply to accounts, not account holders.
    
    Args:
        ccgce_df: DataFrame containing the CashConnectedGraphChangeEvents
        source_labels_df: DataFrame with source account holder's label details from all their accounts
        target_labels_df: DataFrame with target account holders' label details from all their accounts
        source_account_holder: The source account holder token being validated
        validation_result: ValidationResult object to store any errors
        
    The label DataFrames should contain:
        - ACCOUNT_HOLDER_TOKEN: The account holder
        - ACCOUNT_TOKEN: The account with the label
        - LABEL: The label name
        - LABEL_EFFECTIVE_AT: When the label became effective
        - ACCOUNT_MAPPING_EFFECTIVE_AT: When the account was mapped to the holder
    """
    # Get source account holder's active labels, excluding ACCOUNT_DENYLISTED
    source_labels = set()
    if not source_labels_df.empty:
        source_labels = set(
            source_labels_df[source_labels_df['LABEL'] != 'ACCOUNT_DENYLISTED']['LABEL'].tolist()
        )

    # Create mapping of target account holders to their active labels
    target_holder_labels = {}
    if not target_labels_df.empty:
        for account_holder in target_labels_df['ACCOUNT_HOLDER_TOKEN'].unique():
            holder_labels = target_labels_df[
                (target_labels_df['ACCOUNT_HOLDER_TOKEN'] == account_holder) &
                (target_labels_df['LABEL'] != 'ACCOUNT_DENYLISTED')
            ]['LABEL'].tolist()
            target_holder_labels[account_holder] = set(holder_labels)

    # Validate each event
    for _, event in ccgce_df.iterrows():
        # Verify this is the correct source account holder
        if event['source_user_token'] != source_account_holder:
            validation_result.add_error(
                "label",
                f"Event {event['event_id']} has unexpected source_user_token: "
                f"{event['source_user_token']} (expected {source_account_holder})"
            )
            continue

        event_source_labels = set(event['source_user_labels'])
        event_target_labels = set(event['target_user_labels'])
        target_holder = event['target_user_token']
        expected_target_labels = target_holder_labels.get(target_holder, set())

        # Validate source labels
        if event_source_labels != source_labels:
            validation_result.add_error(
                "label",
                f"Source label mismatch for event {event['event_id']}:\n"
                f"  Expected: {sorted(source_labels)}\n"
                f"  Actual: {sorted(event_source_labels)}\n"
                f"  Source label details:\n{source_labels_df.to_string()}"
            )

        # Validate target labels
        if event_target_labels != expected_target_labels:
            target_df_for_holder = target_labels_df[
                target_labels_df['ACCOUNT_HOLDER_TOKEN'] == target_holder
            ]
            validation_result.add_error(
                "label",
                f"Target label mismatch for event {event['event_id']}:\n"
                f"  Target account holder: {target_holder}\n"
                f"  Expected: {sorted(expected_target_labels)}\n"
                f"  Actual: {sorted(event_target_labels)}\n"
                f"  Target label details for this holder:\n"
                f"{target_df_for_holder.to_string()}"
            )

### Timestamp Validation
Functions to validate event timestamps.

In [0]:
def validate_timestamps(
    ccgce_df: pd.DataFrame,
    asset_connections: pd.DataFrame, 
    validation_result: ValidationResult
):
    """Validates that event timestamps match the earliest effective connection time.
    
    The effective connection time between two account holders via an asset is the
    later of:
    1. When the source account connected to the asset
    2. When the target account connected to the asset
    
    Args:
        ccgce_df: DataFrame containing the CashConnectedGraphChangeEvents
        asset_connections: DataFrame with asset connection details including:
            - ACCOUNT_TOKEN: The connected account
            - ASSET_TOKEN: The shared asset
            - ASSET_TYPE: Type of the shared asset
            - ASSET_CONNECTION_TIME: When the account connected to the asset
            - SOURCE_CONNECTION_TIME: When the source account connected to the asset
            - ACCOUNT_HOLDER_TOKEN: The account holder owning the connected account
            - EARLIEST_CONNECTION_TIME: Later of source and target connection times
        validation_result: ValidationResult object to store any errors
    """
    for _, event in ccgce_df.iterrows():
        target_holder = event['target_user_token']
        source_holder = event['source_user_token']
        asset_type = event['changed_node_type']

        # Find connections for this target account holder via this asset type
        target_connections = asset_connections[
            (asset_connections['ACCOUNT_HOLDER_TOKEN'] == target_holder) &
            (asset_connections['ASSET_TYPE'] == asset_type)
        ]

        if target_connections.empty:
            validation_result.add_error(
                "timestamp",
                f"No shared assets found for connection between {source_holder} and {target_holder} via {asset_type}"
            )
            continue

        # The earliest_connection_time is already the later of source and target connection times
        expected_timestamp = pd.to_datetime(target_connections['EARLIEST_CONNECTION_TIME'].min())
        actual_timestamp = pd.to_datetime(event['effective_at_millis'], unit='ms')
        
        if actual_timestamp != expected_timestamp:
            validation_result.add_error(
                "timestamp",
                f"Timestamp mismatch for event {event['event_id']}:\n"
                f"  Expected (earliest effective connection): {expected_timestamp}\n"
                f"  Actual (event effective_at): {actual_timestamp}"
            )

### Account holder functions

In [0]:
def get_account_holder_for_account(
    account_token: str,
    effective_at: str
) -> Optional[str]:
    """Gets the account holder for a CASH_CUSTOMER account.
    
    Args:
        account_token: The CASH_CUSTOMER token to find the account holder for
        effective_at: Timestamp for point-in-time query
        
    Returns:
        The ACCOUNT_HOLDER_CASH_CUSTOMER token or None if not found
    """
    query = f"""
    SELECT FROM_TOKEN as ACCOUNT_HOLDER_TOKEN
    FROM duplograph.public.edges
    WHERE FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
    AND TO_TYPE = 'CASH_CUSTOMER'
    AND TO_TOKEN = '{account_token}'
    AND EFFECTIVE_AT <= '{effective_at}'
    """
    
    df = snowflake_query(query)
    return df['ACCOUNT_HOLDER_TOKEN'].iloc[0] if not df.empty else None


In [0]:
def get_all_active_accounts(
    account_holders: Set[str],
    effective_at: str
) -> Dict[str, Set[str]]:
    """Gets all active (non-merged) accounts for a set of account holders.
    
    Args:
        account_holders: Set of account holder tokens to get accounts for
        effective_at: Timestamp for point-in-time query
        
    Returns:
        Dictionary mapping account holder tokens to their set of active (non-merged) account tokens.
        Account holders with no active accounts will have empty sets or be missing from the dictionary.
    """
    if not account_holders:
        return {}
        
    query = f"""
    WITH account_edges AS (
        -- Get all account mappings for the account holders
        SELECT 
            FROM_TOKEN AS account_holder_token,
            TO_TOKEN AS account_token,
            EFFECTIVE_AT AS mapping_effective_at
        FROM duplograph.public.edges
        WHERE FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
        AND TO_TYPE = 'CASH_CUSTOMER'
        AND FROM_TOKEN IN {to_sql_list(account_holders)}
        AND EFFECTIVE_AT <= '{effective_at}'
    )
    SELECT 
        ae.account_holder_token,
        ae.account_token
    FROM account_edges ae
    WHERE NOT EXISTS (
        -- Exclude merged accounts
        SELECT 1
        FROM duplograph.public.edges m
        WHERE m.FROM_TOKEN = ae.account_token
        AND m.FROM_TYPE = 'CASH_CUSTOMER'
        AND m.TO_TYPE = 'CASH_CUSTOMER'
        AND m.EFFECTIVE_AT <= '{effective_at}'
    )
    """
    
    df = snowflake_query(query)
    
    # Convert to dictionary mapping account holders to their account sets
    result = {}
    if not df.empty:
        for account_holder in account_holders:
            accounts = df[df['ACCOUNT_HOLDER_TOKEN'] == account_holder]['ACCOUNT_TOKEN'].tolist()
            result[account_holder] = set(accounts)
    
    return result

In [0]:
def get_connected_account_holders(
    source_holder: str,
    source_accounts: Set[str],
    effective_at: str,
    asset_types: Optional[Set[str]] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, Set[str]]:
    """Gets account holders connected to source accounts via shared assets.
    
    Args:
        source_holder: The source account holder token
        source_accounts: Set of active account tokens belonging to the source account holder
        effective_at: Timestamp for point-in-time query
        asset_types: Optional set of asset types to filter by
        
    Returns:
        Tuple of (
            asset_connections: DataFrame with asset connection details including:
                - ACCOUNT_TOKEN: The connected account
                - ASSET_TOKEN: The shared asset
                - ASSET_TYPE: Type of the shared asset
                - ASSET_CONNECTION_TIME: When the account connected to the asset
                - SOURCE_CONNECTION_TIME: When the source account connected to the asset
                - ACCOUNT_HOLDER_TOKEN: The account holder owning the connected account
                - EARLIEST_CONNECTION_TIME: Later of source and target connection times
            holder_mapping: DataFrame mapping account holders to their connected accounts:
                - ACCOUNT_HOLDER_TOKEN: The account holder
                - ACCOUNT_TOKEN: Their account that shares an asset
                - MAPPING_EFFECTIVE_AT: When the account was mapped to the holder
            connected_holders: Set of account holder tokens that share assets with the source
        )
    """
    if not source_accounts:
        return pd.DataFrame(), pd.DataFrame(), set()
        
    asset_type_clause = "1=1"
    if asset_types:
        asset_type_clause = f"TO_TYPE IN {to_sql_list(asset_types)}"
        
    query = f"""
    WITH source_assets AS (
        -- Get assets connected to source accounts
        SELECT DISTINCT 
            TO_TOKEN,
            TO_TYPE,
            MIN(EFFECTIVE_AT) as SOURCE_CONNECTION_TIME  -- When source first connected
        FROM duplograph.public.edges
        WHERE FROM_TYPE = 'CASH_CUSTOMER'
        AND FROM_TOKEN IN {to_sql_list(source_accounts)}
        AND TO_TYPE != 'CASH_CUSTOMER'
        AND {asset_type_clause}
        AND EFFECTIVE_AT <= '{effective_at}'
        GROUP BY TO_TOKEN, TO_TYPE
    ),
    connected_accounts AS (
        -- Get accounts connected to those assets, excluding source accounts
        SELECT DISTINCT
            e.FROM_TOKEN AS ACCOUNT_TOKEN,
            e.TO_TOKEN AS ASSET_TOKEN,
            e.TO_TYPE AS ASSET_TYPE,
            MIN(e.EFFECTIVE_AT) AS ASSET_CONNECTION_TIME,  -- When target first connected
            MIN(sa.SOURCE_CONNECTION_TIME) AS SOURCE_CONNECTION_TIME
        FROM duplograph.public.edges e
        JOIN source_assets sa ON e.TO_TOKEN = sa.TO_TOKEN 
            AND e.TO_TYPE = sa.TO_TYPE
        WHERE e.FROM_TYPE = 'CASH_CUSTOMER'
        AND e.FROM_TOKEN NOT IN {to_sql_list(source_accounts)}
        AND e.EFFECTIVE_AT <= '{effective_at}'
        GROUP BY e.FROM_TOKEN, e.TO_TOKEN, e.TO_TYPE
    ),
    account_holders AS (
        -- Get account holder relationships for connected accounts
        SELECT 
            e.FROM_TOKEN AS ACCOUNT_HOLDER_TOKEN,
            e.TO_TOKEN AS ACCOUNT_TOKEN,
            e.EFFECTIVE_AT AS MAPPING_EFFECTIVE_AT
        FROM duplograph.public.edges e
        WHERE e.FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
        AND e.TO_TYPE = 'CASH_CUSTOMER'
        AND e.TO_TOKEN IN (SELECT ACCOUNT_TOKEN FROM connected_accounts)
        AND e.FROM_TOKEN != '{source_holder}'
        AND e.EFFECTIVE_AT <= '{effective_at}'
    ),
    connections AS (
        -- Join with account holders and get earliest connection times
        SELECT 
            ca.*,
            ah.ACCOUNT_HOLDER_TOKEN,
            ah.MAPPING_EFFECTIVE_AT,
            -- For each account holder, find the earliest effective connection time
            -- (the later of source connection and target connection)
            MIN(GREATEST(ca.ASSET_CONNECTION_TIME, ca.SOURCE_CONNECTION_TIME)) 
                OVER (PARTITION BY ah.ACCOUNT_HOLDER_TOKEN, ca.ASSET_TYPE) AS EARLIEST_CONNECTION_TIME
        FROM connected_accounts ca
        JOIN account_holders ah ON ca.ACCOUNT_TOKEN = ah.ACCOUNT_TOKEN
    )
    SELECT * FROM connections
    ORDER BY ACCOUNT_HOLDER_TOKEN, ASSET_TOKEN
    """
    
    df = snowflake_query(query)
    
    if df.empty:
        return df, pd.DataFrame(), set()
        
    # Extract unique account holder tokens
    connected_holders = set(df['ACCOUNT_HOLDER_TOKEN'].unique())
    
    # Create account holder mapping DataFrame
    holder_mapping = df[['ACCOUNT_HOLDER_TOKEN', 'ACCOUNT_TOKEN', 'MAPPING_EFFECTIVE_AT']].drop_duplicates()
    
    # Create asset connections DataFrame
    asset_connections = df[[
        'ACCOUNT_TOKEN', 'ASSET_TOKEN', 'ASSET_TYPE', 'ASSET_CONNECTION_TIME',
        'SOURCE_CONNECTION_TIME', 'ACCOUNT_HOLDER_TOKEN', 'EARLIEST_CONNECTION_TIME'
    ]].drop_duplicates()
    
    return asset_connections, holder_mapping, connected_holders

In [0]:
def get_account_holder_labels(
    account_holder_accounts: Dict[str, Set[str]],
    effective_at: str
) -> pd.DataFrame:
    """Gets active labels for account holders via their accounts.
    
    Args:
        account_holder_accounts: Dict mapping account holder tokens to their active account tokens
        effective_at: Timestamp for point-in-time query
        
    Returns:
        DataFrame with label information
    """
    # Flatten account tokens for the query
    all_accounts = {
        account 
        for accounts in account_holder_accounts.values() 
        for account in accounts
    }
    
    if not all_accounts:
        return pd.DataFrame(columns=[
            'ACCOUNT_HOLDER_TOKEN', 'ACCOUNT_TOKEN', 'LABEL',
            'LABEL_EFFECTIVE_AT', 'ACCOUNT_MAPPING_EFFECTIVE_AT'
        ])

    query = f"""
    WITH latest_label_events AS (
        SELECT node_token,
               label,
               present,
               effective_at,
               ROW_NUMBER() OVER (PARTITION BY node_token, label ORDER BY effective_at DESC) as rn
        FROM duplograph.public.label_events
        WHERE node_type = 'CASH_CUSTOMER'
        AND effective_at <= '{effective_at}'
        AND node_token IN {to_sql_list(all_accounts)}
    )
    SELECT 
        m.from_token AS ACCOUNT_HOLDER_TOKEN,
        m.to_token AS ACCOUNT_TOKEN,
        l.label AS LABEL,
        l.effective_at AS LABEL_EFFECTIVE_AT,
        m.effective_at AS ACCOUNT_MAPPING_EFFECTIVE_AT
    FROM latest_label_events l
    JOIN duplograph.public.edges m 
        ON l.node_token = m.to_token
    WHERE l.rn = 1 
    AND l.present = true
    AND m.from_type = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
    AND m.to_type = 'CASH_CUSTOMER'
    AND m.effective_at <= '{effective_at}'
    AND m.from_token IN {to_sql_list(account_holder_accounts.keys())}
    ORDER BY 
        m.from_token,
        m.to_token,
        l.label
    """
    
    return snowflake_query(query)

## Main Validation Logic
Functions to validate BAU edge processing.

In [0]:
def validate_no_events_condition(
    edge_id: str,
    actual_events: Set[str],
    status_message: str
) -> ValidationResult:
    """Validates cases where no events should be present.
    
    Args:
        edge_id: Identifier of the edge being validated ("E/{source_account}/{asset_token}/{asset_type}/{effective_at_ms}")
        actual_events: Set of actual event IDs found
        status_message: Status message explaining why no events are expected
        
    Returns:
        ValidationResult: Result object with appropriate status and any unexpected events
    """
    result = ValidationResult(edge_id)
    
    if actual_events:
        # There are events when there shouldn't be any
        result.add_error("validation_status", status_message)
        for event_id in actual_events:
            result.add_error("unexpected_event", event_id)
    else:
        # This is correct - no events when none are expected
        result.success = True
        result.validation_status = status_message
    
    return result

In [0]:
def validate_bau_edge(
    source_account: str,
    asset_type: str,
    asset_token: str,
    effective_at_ms: int,
    ccgce_table: str
) -> ValidationResult:
    """
    Validates CashConnectedGraphChangeEvents for a BAU (Business-As-Usual) edge,
    following Edgy's Java implementation order, using the provided helper function
    `connection_change_event_id` for event IDs.

    Steps:
    1. Look up source account holder and their active accounts as of given timestamp.
    2. Fetch full asset "neighborhood" of connected holders & accounts.
    3. Generate all candidate connection change events.
    4. Deduplicate (unordered account holder pairs using connection_change_event_id).
    5. Filter for events triggered by the given edge.
    6. Remove any events where the target account holder has only merged/inactive accounts.
    7. Compare produced vs. expected events and report discrepancies, always reporting
       on unexpected events using validate_no_events_condition as in the original logic.

    Args:
        source_account: The source account's CASH_CUSTOMER token.
        asset_type: The node type of the asset the account connects to.
        asset_token: The node token (asset) being connected.
        effective_at_ms: Edge effective timestamp in epoch milliseconds.
        ccgce_table: The table containing CashConnectedGraphChangeEvents.

    Returns:
        ValidationResult summarizing any validation errors found.
    """
    edge_id = f"{source_account}/{asset_token}/{asset_type}/{effective_at_ms}"
    result = ValidationResult(edge_id)
    effective_at = datetime.fromtimestamp(effective_at_ms / 1000)
    effective_at_str = effective_at.strftime('%Y-%m-%d %H:%M:%S.%f')

    # Fetch ConnectionChange events that have this edge as their source event
    actual_events_query = f"""
        SELECT
            cash_connected_graph_change_event.event_id as event_id,
            cash_connected_graph_change_event.event_type as event_type,
            cash_connected_graph_change_event.target_user_token as target_user_token,
            cash_connected_graph_change_event.source_user_token as source_user_token,
            cash_connected_graph_change_event.effective_at_millis as effective_at_millis,
            cash_connected_graph_change_event.published_at_millis as published_at_millis,
            cash_connected_graph_change_event.user_type as user_type,
            cash_connected_graph_change_event.event_source_type as event_source_type,
            cash_connected_graph_change_event.connection_change.changed_node_type as changed_node_type,
            cash_connected_graph_change_event.connection_change.source_user_labels as source_user_labels,
            cash_connected_graph_change_event.connection_change.target_user_labels as target_user_labels        
        FROM {ccgce_table}
        WHERE cash_connected_graph_change_event.connection_change.source_edge.from_node.token = '{source_account}'
        AND cash_connected_graph_change_event.connection_change.source_edge.to_node.type = '{asset_type}'
        AND cash_connected_graph_change_event.connection_change.source_edge.to_node.token = '{asset_token}'
        AND cash_connected_graph_change_event.connection_change.source_edge.effective_at_msec = '{effective_at_ms}'
    """
    actual_events_df = spark_query(actual_events_query)
    actual_event_id_set = set(actual_events_df['event_id'])

    # 1. Lookup source account holder and their active accounts
    source_account_holder = get_account_holder_for_account(source_account, effective_at_str)

    # If no source account holder, there should be no events
    if not source_account_holder:
        return validate_no_events_condition(edge_id, actual_event_id_set, "No account holder found for source account")
    
    # Get all active accounts for the source account holder
    source_active_accounts = get_all_active_accounts({source_account_holder}, effective_at_str).get(source_account_holder, set())
    if not source_active_accounts:
        return validate_no_events_condition(
            edge_id, actual_event_id_set, "Account holder has no active accounts. No change events expected."
        )

    # 2. Fetch full asset neighborhood (all connected holders & their accounts for asset_type)
    asset_connections, holder_mapping, connected_holders = get_connected_account_holders(
        source_account_holder, source_active_accounts, effective_at_str, {asset_type}
    )
    if asset_connections.empty:
        return validate_no_events_condition(edge_id, actual_event_id_set, "No shared assets found")

    # 3. Generate all candidate events for every connected account holder (excluding self)
    candidate_events = []
    for _, row in asset_connections.iterrows():
        target_holder = row['ACCOUNT_HOLDER_TOKEN']
        asset_tok = row['ASSET_TOKEN']
        connect_time = pd.to_datetime(row['EARLIEST_CONNECTION_TIME'])
        if target_holder != source_account_holder:
            evt = {
                "source": source_account_holder,
                "target": target_holder,
                "asset_token": asset_tok,
                "asset_type": asset_type,
                "earliest_connection_time": connect_time
            }
            candidate_events.append(evt)

    # 4. Deduplicate
    deduped = {}
    for evt in candidate_events:
        eid = connection_change_event_id(evt["source"], evt["target"], evt["asset_type"])
        if eid not in deduped or evt["earliest_connection_time"] < deduped[eid]["earliest_connection_time"]:
            deduped[eid] = evt

    # 5. Filter for events triggered by this edge at this asset and time
    edge_timestamp = pd.to_datetime(effective_at_ms, unit='ms')
    relevant_events = {}
    for evt in deduped.values():
        if (
            evt["earliest_connection_time"] == edge_timestamp
            and evt["asset_token"] == asset_token
        ):
            eid = connection_change_event_id(evt["source"], evt["target"], evt["asset_type"])
            relevant_events[eid] = evt

    # 6. Remove events where the target account holder is merged/inactive (batch for efficiency)
    targets_to_check = {evt["target"] for evt in relevant_events.values()}
    targets_active_accounts = get_all_active_accounts(targets_to_check, effective_at_str)
    
    target_active_holders = {
        holder for holder, accounts in targets_active_accounts.items()
        if accounts
    }
    
    if not target_active_holders:
        return validate_no_events_condition(
            edge_id, actual_event_id_set, "No target account holders with active accounts found")

    final_event_ids = set()
    for eid, evt in relevant_events.items():
        target_active = targets_active_accounts.get(evt["target"], set())
        if target_active:
            final_event_ids.add(eid)

    # 7. Compare: handle both the case of missing/extra events and when no events are expected
    if not final_event_ids:
        # If no events are expected, validate that none were output
        return validate_no_events_condition(edge_id, actual_event_id_set, "No change events expected")

    missing_events = final_event_ids - actual_event_id_set
    unexpected_events = actual_event_id_set - final_event_ids

    if missing_events:
        for eid in sorted(missing_events):
            result.add_error("missing_event", f"Expected but not found: {eid}")
    if unexpected_events:
        for eid in sorted(unexpected_events):
            result.add_error("unexpected_event", f"Produced by Edgy but not expected: {eid}")

    # Create complete mapping including both source and target account holders
    all_account_mapping = {
        source_account_holder: source_active_accounts,
        **targets_active_accounts
    }

    # Get all labels in a single query using the complete account mapping
    source_and_target_labels = get_account_holder_labels(
        all_account_mapping, effective_at
    )
    
    source_labels = source_and_target_labels[
        source_and_target_labels['ACCOUNT_HOLDER_TOKEN'] == source_account_holder
    ]
    target_labels = source_and_target_labels[
        source_and_target_labels['ACCOUNT_HOLDER_TOKEN'].isin(target_active_holders)
    ]
    
    # Validate timestamps and labels
    validate_timestamps(
        ccgce_df=actual_events_df,
        asset_connections=asset_connections,
        validation_result=result
    )

    validate_labels(
        ccgce_df=actual_events_df,
        source_labels_df=source_labels,
        target_labels_df=target_labels,
        source_account_holder=source_account_holder,
        validation_result=result
    )

    return result

In [0]:
def validate_edges(
    edges: List[Tuple[str, str, str, int]],
    ccgce_table: str,
    output_dir: Optional[str] = None,
    csv_path: Optional[str] = None
) -> List[ValidationResult]:
    """Validates edgy BAU edge processing for multiple edges.
    
    Args:
        edges: List of (from_token, to_token, to_type, effective_at) tuples to validate
        ccgce_table: The table containing CashConnectedGraphChangeEvents
        output_dir: Optional directory for output files. If not provided,
                   defaults to current directory
        csv_path: Optionally set the output csv file explicitly. If not
                  provided, it will be auto-generated
    
    Returns:
        List of ValidationResult objects, one per edge
    """
    num_edges = len(edges)
    print(f"Validating {num_edges} edges:")
    for edge in edges:
        print(f"- {edge[0]} -> {edge[1]} ({edge[2]}) @ {edge[3]}")
    
    # Create output directory if it doesn't exist
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)
    else:
        output_dir = '.'
        
    # Create CSV filename with timestamp
    if csv_path is None:
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        csv_path = os.path.join(output_dir, f'edge_validation_results_{timestamp}.csv')
    
    results = []
    timings = {}
    errors = []
    
    for i, edge in enumerate(edges):
        from_token, to_token, to_type, effective_at_ms = edge
        edge_id = f"{from_token}/{to_token}/{to_type}/{effective_at_ms}"
        effective_at_str = pd.to_datetime(effective_at_ms, unit='ms')
        print(f"\nProcessing event {i+1}/{num_edges}: {edge_id}, effective_at={effective_at_str}")

        start_time = time.time()
        
        try:
            result = validate_bau_edge(
                source_account=from_token,
                asset_token=to_token,
                asset_type=to_type,
                effective_at_ms=effective_at_ms,
                ccgce_table=ccgce_table
            )
            results.append(result)
            
            # Write result to CSV as soon as it's available
            result.write_to_csv(csv_path)
            
            end_time = time.time()
            timings[edge_id] = end_time - start_time
            
            # Print progress with timing and detailed results
            duration = timings[edge_id]
            print(f"\nTime to validate {edge_id}: {duration:.2f} seconds")
            print(result.get_summary())
            print()
        except Exception as e:
            error_msg = f"ERROR while validating event {edge_id}: {type(e).__name__}: {e}"
            print(error_msg)
            errors.append((edge_id, e))
            # Optionally: Keep running or re-raise depending on workflow; here we keep running next events
            
    # Print summary statistics
    success_count = sum(1 for r in results if r.success)
    print(f"\nValidation complete: {success_count}/{len(results)} passed")
    
    if errors:
        print(f"\n{len(errors)} events failed due to code errors:")
        for edge_id, err in errors:
            print(f"  {edge_id}: {type(err).__name__}: {err}")

    # Timing statistics
    avg_time = sum(timings.values()) / len(timings)
    max_time = max(timings.values())
    min_time = min(timings.values())
    print(f"\nTiming statistics:")
    print(f"- Average time per edge: {avg_time:.2f} seconds")
    print(f"- Fastest validation: {min_time:.2f} seconds")
    print(f"- Slowest validation: {max_time:.2f} seconds")
    print(f"- Total validation time: {sum(timings.values()):.2f} seconds")
    print(f"\nResults written to: {csv_path}")
    
    return results

In [0]:
def validate_edges_batch(
    edges: Set[Tuple[str, str, str, int]],
    ccgce_table: str,
    output_dir: str,
    batch_size: int = 5,
    max_workers: int = 3
) -> str:
    """Validates edges in parallel batches.
    
    Args:
        edges: Set of (from_token, to_token, to_type, effective_at) tuples to validate
        ccgce_table: The table containing CashConnectedGraphChangeEvents
        output_dir: Directory for output files
        batch_size: Number of edges to process in each batch
        max_workers: Maximum number of parallel threads
    
    Returns:
        String with the path to the csv file containing the validation results
    """
    # Create output directory
    os.makedirs(output_dir, exist_ok=True)
    
    # Create a single CSV filename for all batches
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    csv_path = os.path.join(output_dir, f'bau_edge_validation_{timestamp}.csv')
    
    # Split edges into batches
    edges_list = list(edges)
    num_total_edges = len(edges_list)
    num_batches = math.ceil(num_total_edges / batch_size)
    batches = [
        edges_list[i * batch_size:(i + 1) * batch_size]
        for i in range(num_batches)
    ]
    
    print(f"Processing {num_total_edges} edges in {num_batches} batches "
          f"of size {batch_size} using {max_workers} workers")
    print(f"Results will be written to: {csv_path}")
    
    start_time = time.time() 

    # Process batches in parallel
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [
            executor.submit(
                validate_edges,
                batch,
                ccgce_table,
                output_dir,
                csv_path
            )
            for batch in batches
        ]
        
        # Process results as they complete
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            try:
                future.result()
                print(f"\nCompleted batch {i + 1}/{num_batches}")
            except Exception as e:
                print(f"\nBatch {i + 1} failed with error: {str(e)}")
    
    end_time = time.time()
    print(f"\nAll batches processed. Total time: {end_time - start_time:.2f} seconds.")
    print(f"Consolidated results are in: {csv_path}")
    
    return csv_path

## Validation Execution

In [0]:
# Table where edgy's output is loaded:
#ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250512"
#ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250610"
ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250626"

In [0]:
# Obtain the date range of BAU events in ccgce_table that can be meaningfully validated 
query = f"""
SELECT MIN(cash_connected_graph_change_event.effective_at_millis) AS min_effective_at,
    MIN(cash_connected_graph_change_event.published_at_millis) AS min_published_at,
    MAX(cash_connected_graph_change_event.effective_at_millis) AS max_effective_at,
    MAX(cash_connected_graph_change_event.published_at_millis) AS max_published_at
    from {ccgce_table}
WHERE cash_connected_graph_change_event.event_source_type = 'BAU'
AND cash_connected_graph_change_event.connection_change is not null
"""
date_range_df = spark_query(query)
# Use the max of the earliest effective_at and published_at timestamps as the start time
start_time = pd.to_datetime(max(date_range_df[["min_effective_at", "min_published_at"]].iloc[0]), unit="ms")
# Use the min of the latest effective_at and published_at timestamps as the finish time
finish_time = pd.to_datetime(min(date_range_df[["max_effective_at", "max_published_at"]].iloc[0]), unit="ms")
print(f"Edgy output time range to validate: {start_time} - {finish_time}")

Edgy output time range to validate: 2025-06-26 00:00:02.485000 - 2025-06-26 23:59:59.874000


In [0]:
# Fetch sample of CASH_CUSTOMER edges from duplograph.public.edges that were 
# both created and became effective within the time range of the edgy output we're validating
# (avoiding the boundaries in case of lag issues) 
buffer_timedelta = pd.to_timedelta('3h')
query = f"""
SELECT 
    from_token, 
    to_type, 
    to_token, 
    effective_at,
    created_at 
FROM duplograph.public.edges 
WHERE from_type = 'CASH_CUSTOMER' 
AND to_type != 'CASH_CUSTOMER'
AND effective_at BETWEEN '{start_time + buffer_timedelta}' AND '{finish_time - buffer_timedelta}'
AND created_at BETWEEN '{start_time + buffer_timedelta}' AND '{finish_time - buffer_timedelta}'
LIMIT 1000
"""
bau_edges_df = snowflake_query(query)
bau_edges_df

Unnamed: 0,FROM_TOKEN,TO_TYPE,TO_TOKEN,EFFECTIVE_AT,CREATED_AT
0,C_5qe4chm0g,CASH_OPAQUE_APP_TOKEN,6729512c0ba529fc969797f726d4f90939ff8615,2025-06-26 17:44:33.620,2025-06-26 17:45:04.235
1,C_gs2wc0ysv,TRACKING_COOKIE,f3f4d065-bc15-44e6-ada1-fba98860d880,2025-06-26 20:39:43.572,2025-06-26 20:39:43.735
2,C_f4e3camw8,CASH_DEVICE,0ff7a8fa594c078e,2025-06-26 17:11:43.590,2025-06-26 17:11:44.042
3,C_84eqd0m9z,CASH_OPAQUE_APP_TOKEN,c479da5b17b3f3e87147190a2464876f70955e79,2025-06-26 19:30:21.328,2025-06-26 19:30:21.908
4,C_htezqgmvg,CASH_BACKUP_TAG,aee7ccf2928067169754e849ab55bb7cf8a21b043cc97d...,2025-06-26 16:22:42.306,2025-06-26 16:22:43.042
...,...,...,...,...,...
995,C_sm2vc1ynq,TRACKING_COOKIE,6919f0a9-276b-4e25-8711-6f0cfa7b9b83,2025-06-26 20:00:46.731,2025-06-26 20:00:46.894
996,C_8jemehmh8,CASH_DEVICE,7DA924C4-3C6F-4046-89EE-EB73EA14EA7B,2025-06-26 19:19:14.182,2025-06-26 19:19:14.540
997,C_sjcsepm2q,CASH_DEVICE,d17a7eb2c890c8c6,2025-06-26 19:47:45.852,2025-06-26 19:47:46.431
998,C_s9c4egmjj,CASH_DEVICE,E5DAAE7B-615A-4C9D-ABE2-E1903A91FC9A,2025-06-26 19:56:11.615,2025-06-26 19:56:13.014


In [0]:
# Convert DataFrame rows to list of edge tuples
edges_to_validate = [
    (
        row['FROM_TOKEN'],
        row['TO_TOKEN'],
        row['TO_TYPE'],
        row['EFFECTIVE_AT'].value // 1_000_000 # convert to ms
    )
    for _, row in bau_edges_df.iterrows()
]

# Run validation in batches
output_file = validate_edges_batch(
    edges=edges_to_validate,
    ccgce_table=ccgce_table,
    output_dir="bau_edge_validation",
    batch_size=10, 
    max_workers=10
)

Processing 1000 edges in 100 batches of size 10 using 10 workers
Results will be written to: bau_edge_validation/bau_edge_validation_20250701_030010.csv
Validating 10 edges:
- C_5qe4chm0g -> 6729512c0ba529fc969797f726d4f90939ff8615 (CASH_OPAQUE_APP_TOKEN) @ 1750959873620
- C_gs2wc0ysv -> f3f4d065-bc15-44e6-ada1-fba98860d880 (TRACKING_COOKIE) @ 1750970383572
- C_f4e3camw8 -> 0ff7a8fa594c078e (CASH_DEVICE) @ 1750957903590
- C_84eqd0m9z -> c479da5b17b3f3e87147190a2464876f70955e79 (CASH_OPAQUE_APP_TOKEN) @ 1750966221328
- C_htezqgmvg -> aee7ccf2928067169754e849ab55bb7cf8a21b043cc97d9cf76505c41d4236d0 (CASH_BACKUP_TAG) @ 1750954962306
- C_8heheay1d -> 4589847A-BE04-40C4-B90C-98A73C45FE62 (CASH_DEVICE) @ 1750965165929
- C_gjebe0mpz -> ec4dba62cac69725d873200ac013763b5def5aee (EMAIL) @ 1750956364965
- C_sqc7dpmda -> DEC01EA0-5169-4860-A780-F03AC4B0821E (CASH_DEVICE) @ 1750971290016
- C_8jegc0mw8 -> 8fed0b13e8b397304a3fc3d2f21b663527397ca1 (PHONE_NUMBER) @ 1750966379153
- C_g6c5pqmhk -> 42e4f5

  return self._ctx.set_alpn_protos(protocols)



Time to validate C_5xdpdgm95/789be1b1-1577-48d6-aa7a-b00b7341b9e6/TRACKING_COOKIE/1750952993455: 21.20 seconds
ℹ️ C_5xdpdgm95/789be1b1-1577-48d6-aa7a-b00b7341b9e6/TRACKING_COOKIE/1750952993455: No shared assets found


Processing event 3/10: C_n4eyd8ytn/0348fa1e-edb0-42c4-934c-f0062dd1cd3b/TRACKING_COOKIE/1750955568048, effective_at=2025-06-26 16:32:48.048000

Time to validate C_ste82pmzm/6a1d6301-4c4c-44ce-95cf-5ffeb6898de8/TRACKING_COOKIE/1750963272378: 15.19 seconds
ℹ️ C_ste82pmzm/6a1d6301-4c4c-44ce-95cf-5ffeb6898de8/TRACKING_COOKIE/1750963272378: No shared assets found


Processing event 3/10: C_s4c1dgmkq/3d56ce3b028998159e7937f9e87e82c43dc54e37b250278d0fa0589af193934d/CASH_BACKUP_TAG/1750970183703, effective_at=2025-06-26 20:36:23.703000

Time to validate C_80cpd8mmc/e1deffcbbbafba485aef2a53e390ffab46abb7cf613c67172a478dc92092f79b/CASH_BACKUP_TAG/1750957690886: 17.72 seconds
ℹ️ C_80cpd8mmc/e1deffcbbbafba485aef2a53e390ffab46abb7cf613c67172a478dc92092f79b/CASH_BACKUP_TAG/1750957690

In [0]:
# Path to store validation outputs under. Create this beforehand with dbutils.fs.mkdirs(f"dbfs:/Users/...")
s3_dir = "Users/ioanna"

# Copy output to S3 for persistence
output_path = f"{s3_dir}/{output_file}"
dbutils.fs.cp(f"file://{os.getcwd()}/{output_file}", f"dbfs:/{output_path}")

Out[19]: True

In [0]:
# # Cell for single edge validation from bau_edges_df
# edge_to_validate = bau_edges_df.iloc[1]

# result = validate_bau_edge(
#     source_account=edge_to_validate['FROM_TOKEN'],
#     asset_type=edge_to_validate['TO_TYPE'],
#     asset_token=edge_to_validate['TO_TOKEN'],
#     effective_at_ms=edge_to_validate['EFFECTIVE_AT'].value // 1_000_000,
#     ccgce_table=ccgce_table
# )
# print(result.get_summary())

ℹ️ C_0scbd8mnq/c0d7b6dc6cd9ad54/CASH_DEVICE/1749584382370: No account holder found for source account


In [0]:
# # Cell for single edge validation
# result = validate_bau_edge(
#     source_account="C_8ydarhm9s",
#     asset_type="CASH_OPAQUE_APP_TOKEN",
#     asset_token="70fe039699ad0b4c09e44f7432af99128a747d7a",
#     effective_at_ms=1748788039295,
#     ccgce_table=ccgce_table
# )
# print(result.get_summary())

✅ C_8ydarhm9s/70fe039699ad0b4c09e44f7432af99128a747d7a/CASH_OPAQUE_APP_TOKEN/1748788039295: All validations passed


## Validation Result Analysis

In [0]:
# load file containing the validation results
validation_df = pd.read_csv(f"/dbfs/{output_path}")
validation_df

Unnamed: 0,edge_id,source_account,asset_token,asset_type,effective_at_ms,validation_success,validation_status,missing_events,unexpected_events,label_errors,timestamp_errors
0,C_gje3ehmwg/d5165cbff2614c18/CASH_DEVICE/17509...,C_gje3ehmwg,d5165cbff2614c18,CASH_DEVICE,1750953837669,True,No account holder found for source account,,,,
1,C_f9er2am8z/5800A4A5-66AD-41CC-B5FD-B9AD7B5607...,C_f9er2am8z,5800A4A5-66AD-41CC-B5FD-B9AD7B5607CA,CASH_DEVICE,1750955573322,True,No account holder found for source account,,,,
2,C_5qe4chm0g/6729512c0ba529fc969797f726d4f90939...,C_5qe4chm0g,6729512c0ba529fc969797f726d4f90939ff8615,CASH_OPAQUE_APP_TOKEN,1750959873620,True,No account holder found for source account,,,,
3,C_8je4ehmhg/794e280266649f5bbc07f70da4d2f37368...,C_8je4ehmhg,794e280266649f5bbc07f70da4d2f37368ffad69,PHONE_NUMBER,1750960038729,True,No shared assets found,,,,
4,C_nmcvk1mnk/2cb75b3d-ea8b-4406-b22f-226d9376bc...,C_nmcvk1mnk,2cb75b3d-ea8b-4406-b22f-226d9376bc7f,TRACKING_COOKIE,1750962784761,True,No shared assets found,,,,
...,...,...,...,...,...,...,...,...,...,...,...
995,C_sm2vc1ynq/6919f0a9-276b-4e25-8711-6f0cfa7b9b...,C_sm2vc1ynq,6919f0a9-276b-4e25-8711-6f0cfa7b9b83,TRACKING_COOKIE,1750968046731,True,No shared assets found,,,,
996,C_8jemehmh8/7DA924C4-3C6F-4046-89EE-EB73EA14EA...,C_8jemehmh8,7DA924C4-3C6F-4046-89EE-EB73EA14EA7B,CASH_DEVICE,1750965554182,True,No account holder found for source account,,,,
997,C_sjcsepm2q/d17a7eb2c890c8c6/CASH_DEVICE/17509...,C_sjcsepm2q,d17a7eb2c890c8c6,CASH_DEVICE,1750967265852,True,No account holder found for source account,,,,
998,C_s9c4egmjj/E5DAAE7B-615A-4C9D-ABE2-E1903A91FC...,C_s9c4egmjj,E5DAAE7B-615A-4C9D-ABE2-E1903A91FC9A,CASH_DEVICE,1750967771615,True,No account holder found for source account,,,,


In [0]:
# Failed validationss
failed_validations_df = validation_df[validation_df['validation_success'] == False]
failed_validations_df

Unnamed: 0,edge_id,source_account,asset_token,asset_type,effective_at_ms,validation_success,validation_status,missing_events,unexpected_events,label_errors,timestamp_errors
316,C_ssccpgypd/696D31BA-B6BC-4136-8C94-C027C2D263...,C_ssccpgypd,696D31BA-B6BC-4136-8C94-C027C2D2631E,CASH_DEVICE,1750963609480,False,No target account holders with active accounts...,,E/AH_5hyhb52n5/AH_dpygpccss/CASH_DEVICE,,


In [0]:
if failed_validations_df.empty:
    print("All validations passed")
else:
    failed_validation_ids = failed_validations_df['edge_id'].unique().tolist()
    num_failed = len(failed_validation_ids)
    all_validated = validation_df['edge_id'].unique().tolist()
    num_validated = len(all_validated)
    print(f"{num_failed} out of {num_validated} edges had failed validations ({num_failed/num_validated*100:.2f}%):\n\n{failed_validation_ids}")

1 out of 1000 edges had failed validations (0.10%):

['C_ssccpgypd/696D31BA-B6BC-4136-8C94-C027C2D2631E/CASH_DEVICE/1750963609480']


### ✅ Missing events

Edges with expected events that were missing in the actual output. These were categorized based on whether the event's `effective_at` timestamp was before, during or after the period of edgy output data that was loaded for validation.

In [0]:
# Prepare DataFrame
missing_events_df = validation_df[validation_df['missing_events'].notna()].copy()

# Convert effective_at_ms to pandas datetime
missing_events_df['effective_at'] = pd.to_datetime(missing_events_df['effective_at_ms'], unit='ms', utc=True)

# Define window for 2025-06-01 using pandas
window_start_dt = pd.to_datetime("2025-06-01", utc=True)
window_end_dt = window_start_dt + pd.to_timedelta("1D")

# Categorize edge timestamps
def categorize(effective_at_dt):
    if effective_at_dt < window_start_dt:
        return "BEFORE"
    elif window_start_dt <= effective_at_dt <= window_end_dt:
        return "DURING"
    else:
        return "AFTER"

missing_events_df["window_category"] = missing_events_df["effective_at"].map(categorize)

# Summary counts
category_counts = missing_events_df['window_category'].value_counts().reindex(["BEFORE","DURING","AFTER"], fill_value=0)

print("--- Missing Event Validation Summary ---")
print(f"Total edges with missing events: {len(missing_events_df)}\n")
print("Count by time window position:")
for cat in ["BEFORE","DURING","AFTER"]:
    print(f"  {cat}: {category_counts[cat]}")

print("\n--- Details by Category ---\n")

for cat in ["BEFORE","DURING","AFTER"]:
    cat_df = missing_events_df[missing_events_df["window_category"] == cat]
    if cat_df.empty:
        continue
    print(f"### {cat} ({len(cat_df)} edges)\n")
    for idx, row in cat_df.iterrows():
        events = row['missing_events'].split('|')
        print(f"- Edge:    {row['edge_id']}")
        print(f"  Time:    {row['effective_at']} (UTC, {row['effective_at_ms']})")
        print(f"  Missing: {len(events)} event(s)")
        for event in events:
            print(f"    - {event}")
        print()

--- Missing Event Validation Summary ---
Total edges with missing events: 0

Count by time window position:
  BEFORE: 0
  DURING: 0
  AFTER: 0

--- Details by Category ---



### ✅ Unexpected events

Edges that had actual events that were not expected based on the validation.

In [0]:
# Filter rows where unexpected_events is not empty
unexpected_events_df = validation_df[validation_df['unexpected_events'].notna()]

if unexpected_events_df.empty:
    print("No unexpected events found in the validation results")
else:
    print(f"Found {len(unexpected_events_df)} edges with unexpected events: \n{unexpected_events_df['edge_id'].to_list()}\n")
    
    for _, row in unexpected_events_df.iterrows():
        events = row['unexpected_events'].split('|')
        
        print(f"Edge: {row['edge_id']}")
        print("Unexpected events:")
        for event in events:
            print(f"  - {event}")
        print()

Found 1 edges with unexpected events: 
['C_ssccpgypd/696D31BA-B6BC-4136-8C94-C027C2D2631E/CASH_DEVICE/1750963609480']

Edge: C_ssccpgypd/696D31BA-B6BC-4136-8C94-C027C2D2631E/CASH_DEVICE/1750963609480
Unexpected events:
  - E/AH_5hyhb52n5/AH_dpygpccss/CASH_DEVICE



### ✅ Label mismatches

Mismatches between expected and actual labels on the change event source and target account holders. Excludes `ACCOUNT_DENYLISTED` labels which don't apply to account holders.

In [0]:
# Filter rows where label_errors is not empty
label_errors_df = validation_df[validation_df['label_errors'].notna()]

def parse_label_error(error_message: str) -> dict:
    """Parse a label error message into its components.
    
    Args:
        error_message: The label error message from validation
        
    Returns:
        Dictionary containing error details or None if parsing fails
    """
    try:
        # Determine if this is source or target label error
        is_source = 'Source label mismatch' in error_message
        is_target = 'Target label mismatch' in error_message
        if not (is_source or is_target):
            return None
            
        # Extract event ID
        event_start = error_message.find('event ') + len('event ')
        event_end = error_message.find(':', event_start)
        event_id = error_message[event_start:event_end].strip()
            
        # Extract target account holder for target label errors
        target_holder = None
        if is_target:
            target_start = error_message.find('Target account holder:') + len('Target account holder:')
            target_end = error_message.find('\n', target_start)
            target_holder = error_message[target_start:target_end].strip()
        
        # Extract expected and actual labels
        expected_start = error_message.find('Expected:') + len('Expected:')
        expected_end = error_message.find('\n', expected_start)
        actual_start = error_message.find('Actual:') + len('Actual:')
        actual_end = error_message.find('\n', actual_start) if '\n' in error_message[actual_start:] else len(error_message)
        
        expected_str = error_message[expected_start:expected_end].strip()
        actual_str = error_message[actual_start:actual_end].strip()
        
        # Convert string representations of lists to sets
        expected_labels = set(eval(expected_str))
        actual_labels = set(eval(actual_str))
            
        return {
            'type': 'source' if is_source else 'target',
            'event_id': event_id,
            'target_holder': target_holder,
            'expected': sorted(expected_labels),
            'actual': sorted(actual_labels)
        }
    except:
        return None

if label_errors_df.empty:
    print("No label errors found in the validation results")
else:
    print(f"Found {len(label_errors_df)} edges with label errors:\n{label_errors_df['edge_id'].to_list()}\n")
    
    relevant_errors = 0
    for _, row in label_errors_df.iterrows():
        errors = row['label_errors'].split('|')
        
        # Parse and filter errors
        parsed_errors = [
            parse_label_error(error) 
            for error in errors
        ]
        parsed_errors = [e for e in parsed_errors if e is not None]
        
        if parsed_errors:
            relevant_errors += 1
            print(f"Edge: {row['edge_id']}")
            for error in parsed_errors:
                if error['type'] == 'source':
                    print(f"  Source label mismatch (event: {error['event_id']}):")
                    print(f"    Expected: {error['expected']}")
                    print(f"    Actual:   {error['actual']}")
                else:
                    print(f"  Target label mismatch (event: {error['event_id']}, target: {error['target_holder']}):")
                    print(f"    Expected: {error['expected']}")
                    print(f"    Actual:   {error['actual']}")
            print()
    
    print(f"Found {relevant_errors} edges with label errors")
    if relevant_errors < len(label_errors_df):
        print(f"({len(label_errors_df) - relevant_errors} edges had only expected ACCOUNT_DENYLISTED differences)")

No label errors found in the validation results


### ✅ Timestamp errors

Edges with mismatches between expected event timestamps (based on the earliest connection time between source/target via the given asset type) and the actual `effective_at_millis` in the generated change events.

In [0]:
# Filter rows where timestamp_errors is not empty and not NaN
timestamp_errors_df = validation_df[
    validation_df['timestamp_errors'].notna() & (validation_df['timestamp_errors'] != '')
]

if timestamp_errors_df.empty:
    print("No timestamp errors found in the validation results.")
else:
    print(f"Found {len(timestamp_errors_df)} edges with timestamp errors:\n{timestamp_errors_df['edge_id'].to_list()}\n")
    for index, row in timestamp_errors_df.iterrows():
        # Split the pipe-separated error strings
        error_messages = row['timestamp_errors'].split('|')
        
        print(f"Edge: {row['edge_id']}")
        print(f"Number of timestamp errors: {len(error_messages)}")
        print("Timestamp error details:")
        for error_detail in error_messages:
            # Replace literal '\\n' from CSV with actual newlines for readable printing
            readable_error_detail = error_detail.replace('\\\\n', '\\n')
            print(f"  - {readable_error_detail}")
        print("\n") # Add a blank line for separation between account holders

No timestamp errors found in the validation results.


## ✅ General output checks

Checks general properties of the output event data set:
- Are there any events that have the same source_user_token and target_user_token?
- Are there duplicate events with the same event_id?
- Are there are any events where event_time_millis is different to cash_connected_graph_change_event.effective_at_millis?
- Are there any events where cash_connected_graph_change_event.connection_change.changed_node_type is not in the event_id?
- Are there are any events where the source_user_token and target_user_token are not in the event_id?

In [0]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import udf, col, count as spark_count, lit, when
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F

def parse_event_id(event_id: str) -> Tuple[str, str, str]:
    """Parse an event ID of format 'E/token1/token2/type' into its components.
    Returns (token1, token2, type)."""
    try:
        _, tokens_and_type = event_id.split('E/', 1)
        token1, token2, node_type = tokens_and_type.split('/')
        return (token1, token2, node_type)
    except:
        return (None, None, None)

# Define the schema for the UDF's return type (a struct)
event_id_schema = StructType([
    StructField("id_token1", StringType(), True),
    StructField("id_token2", StringType(), True),
    StructField("id_node_type", StringType(), True)
])

# Create the Spark UDF
parse_event_id_udf = udf(parse_event_id, event_id_schema)

def format_event(event: dict) -> str:
    """Format a single CashConnectedGraphChangeEvent in a readable way."""
    source_labels_display = tuple(sorted(event.get('source_user_labels', [])))
    target_labels_display = tuple(sorted(event.get('target_user_labels', [])))

    output = [
        f"Event ID: {event['event_id']}",
        f"Source: {event['source_user_token']}",
        f"Target: {event['target_user_token']}",
        f"Node Type: {event['changed_node_type']}",
        f"Source labels: {source_labels_display}",
        f"Target labels: {target_labels_display}",
        f"Event Time: {event['event_time_millis']}",
        f"Effective At: {event['effective_at_millis']}",
        f"Published At: {event['published_at_millis']}"
    ]
    return "\n  ".join(output)

def format_validation_sample(df: pd.DataFrame, validation_type: str, max_samples: int = 3) -> str:
    """Format validation results with samples of problematic events."""
    output = []
    for idx, event in df.head(max_samples).iterrows():
        output.append(f"\n{validation_type} {idx + 1}:")
        output.append("  " + format_event(event))
    return "\n".join(output)

In [0]:
def analyze_duplicate_event_ids(df):
    """
    Analyze and report on duplicate event_ids in the Spark DataFrame.
    For each duplicated event_id, identify if payloads differ.
    If payloads differ, categorize if they are swapped or non-swapped with different payloads.
    Displays distinct differing payloads (including sorted labels and published_at_millis) for non-swapped cases.
    Reports top occurrences specifically for events with non-swapped differing payloads.
    """

    print("\n\n2. Event IDs that appear more than once:")

    df_prep = df.withColumn(
        "source_labels_s", 
        F.sort_array(F.col("source_user_labels"))
    ).withColumn(
        "target_labels_s", 
        F.sort_array(F.col("target_user_labels"))
    )

    payload_id_fields = [ 
        'effective_at_millis',
        'event_time_millis',
        'source_user_token',
        'target_user_token',
        'source_labels_s', 
        'target_labels_s',
        'source_edge_from_node_token',
        'source_edge_to_node_token',
        'source_edge_to_node_type',
        'source_edge_effective_at',
        'source_edge_created_at',
        'source_edge_updated_at'
    ]

    df_with_payload_struct = df_prep.withColumn(
        "payload_struct", 
        F.struct(*[F.col(c) for c in payload_id_fields])
    )

    all_detail_fields = [ 
        "event_id", "source_user_token", "target_user_token", 
        "source_user_labels", "target_user_labels", 
        "source_labels_s", "target_labels_s",       
        "effective_at_millis", "event_time_millis", 
        "changed_node_type", "published_at_millis",
        "source_edge_from_node_token",
        "source_edge_to_node_token",
        "source_edge_to_node_type",
        "source_edge_effective_at",
        "source_edge_created_at",
        "source_edge_updated_at"
    ]

    dupe_analysis_df = ( 
        df_with_payload_struct.groupBy("event_id")
        .agg(
            F.count("*").alias("total_occurrences"),
            F.collect_set("payload_struct").alias("distinct_payloads"), 
            F.collect_list(F.struct(*[F.col(c) for c in all_detail_fields]))
                .alias("all_instances_for_id") 
        )
        .filter(F.col("total_occurrences") > 1)
        .withColumn("distinct_payload_count", F.size(F.col("distinct_payloads"))) 
        .cache()
    )

    total_dupe_ids = dupe_analysis_df.count() 
    print(f"  Total event_ids with at least one duplicate: {total_dupe_ids}")

    diff_payload_dupes_df = dupe_analysis_df.filter(F.col("distinct_payload_count") > 1).cache() 
    num_diff_payload_dupe_ids = diff_payload_dupes_df.count() 
    
    print(f"  Number of these event_ids that have differing business payloads: {num_diff_payload_dupe_ids}")

    swapped_ids_count = 0 
    non_swapped_diff_ids_count = 0 
    
    non_swapped_examples = [] 
    non_swapped_diff_ids_list = [] 

    if num_diff_payload_dupe_ids > 0:
        diff_payload_rows = diff_payload_dupes_df.collect() 

        for row_data in diff_payload_rows:
            event_id = row_data["event_id"] 
            distinct_payloads = row_data["distinct_payloads"] 
            all_instances_for_id = row_data["all_instances_for_id"]

            is_swapped = False 
            if len(distinct_payloads) == 2: 
                p1 = distinct_payloads[0]
                p2 = distinct_payloads[1]
                if (p1.effective_at_millis == p2.effective_at_millis and
                    p1.event_time_millis == p2.event_time_millis and
                    p1.source_user_token == p2.target_user_token and
                    p1.target_user_token == p2.source_user_token and
                    p1.source_edge_to_node_type == p2.source_edge_to_node_type and
                    p1.source_edge_to_node_token == p2.source_edge_to_node_token and
                    list(p1.source_labels_s) == list(p2.target_labels_s) and
                    list(p1.target_labels_s) == list(p2.source_labels_s)):
                    is_swapped = True
            
            if is_swapped:
                swapped_ids_count += 1
            else:
                non_swapped_diff_ids_count += 1
                non_swapped_diff_ids_list.append(event_id)
                if len(non_swapped_examples) < 5: 
                    non_swapped_examples.append(
                        (event_id, all_instances_for_id) 
                    )
        
        print(f"    Number of event_ids with differing payloads that ARE SWAPPED: {swapped_ids_count}")
        print(f"    Number of event_ids with differing payloads that ARE NOT SWAPPED (truly different): {non_swapped_diff_ids_count}")

        if non_swapped_examples:
            print("\n  Sample of NON-SWAPPED duplicates with different payloads (showing all event instances for these event_ids):")
            for i, (event_id, event_instances) in enumerate(non_swapped_examples): 
                print(f"\n  --- Example {i + 1} for NON-SWAPPED differing Event ID: {event_id} ---")
                for variant_idx, event_struct in enumerate(event_instances): 
                    print(f"    Instance {variant_idx + 1}:")
                    print(f"      Source Token:      {event_struct.source_user_token}")
                    print(f"      Target Token:      {event_struct.target_user_token}")
                    print(f"      Effective At:      {event_struct.effective_at_millis} ({pd.to_datetime(event_struct.effective_at_millis, unit='ms')})")
                    print(f"      Published At:      {event_struct.published_at_millis} ({pd.to_datetime(event_struct.published_at_millis, unit='ms')})")
                    print(f"      Source Labels (sorted): {list(event_struct.source_labels_s or [])}") 
                    print(f"      Target Labels (sorted): {list(event_struct.target_labels_s or [])}") 
                    print(f"      Source edge: {event_struct.source_edge_from_node_token} -> {event_struct.source_edge_to_node_type}:{event_struct.source_edge_to_node_token} @ {event_struct.source_edge_effective_at}, created: {event_struct.source_edge_created_at}, updated: {event_struct.source_edge_updated_at}") 
            print("\n")
    else:
        print("  All event_ids with duplicates have identical business payloads.")

    # if non_swapped_diff_ids_list:
    #     print("\n  Top 10 event_ids with NON-SWAPPED differing payloads (by total occurrences):")
    #     (dupe_analysis_df
    #         .filter(F.col("event_id").isin(non_swapped_diff_ids_list))
    #         .select("event_id", "total_occurrences", "distinct_payload_count")
    #         .orderBy(F.col("total_occurrences").desc())
    #         .show(10, truncate=False))
    # else:
    #     print("\n  No event_ids found with NON-SWAPPED differing payloads to show top occurrences for.")


    dupe_analysis_df.unpersist()
    diff_payload_dupes_df.unpersist()
    
    return total_dupe_ids, num_diff_payload_dupe_ids, swapped_ids_count, non_swapped_diff_ids_count

In [0]:
def validate_bau_connection_change_events(ccgce_table: str, spark_session):
    """Validate general properties of BAU ConnectionChange events.
    
    Args:
        ccgce_table: Full name of the CashConnectedGraphChangeEvent table to validate
        spark_session: The active SparkSession.
    """
    query = f"""
    SELECT 
        event_time_millis,
        cash_connected_graph_change_event.event_id as event_id,
        cash_connected_graph_change_event.source_user_token as source_user_token,
        cash_connected_graph_change_event.target_user_token as target_user_token,
        cash_connected_graph_change_event.effective_at_millis as effective_at_millis,
        cash_connected_graph_change_event.published_at_millis as published_at_millis,
        cash_connected_graph_change_event.connection_change.changed_node_type as changed_node_type,
        cash_connected_graph_change_event.connection_change.source_user_labels as source_user_labels,
        cash_connected_graph_change_event.connection_change.target_user_labels as target_user_labels,
        cash_connected_graph_change_event.connection_change.source_edge.from_node.token as source_edge_from_node_token,
        cash_connected_graph_change_event.connection_change.source_edge.to_node.type as source_edge_to_node_type,
        cash_connected_graph_change_event.connection_change.source_edge.to_node.token as source_edge_to_node_token,
        cash_connected_graph_change_event.connection_change.source_edge.effective_at_msec as source_edge_effective_at,
        cash_connected_graph_change_event.connection_change.source_edge.created_at as source_edge_created_at,
        cash_connected_graph_change_event.connection_change.source_edge.updated_at as source_edge_updated_at
    FROM {ccgce_table}
    WHERE cash_connected_graph_change_event.event_source_type = 'BAU'
    AND cash_connected_graph_change_event.connection_change is not null
    """
    
    spark_df = spark_session.sql(query)
    spark_df.cache() 

    total_events = spark_df.count()
    print(f"Total events: {total_events}\n")

    if total_events == 0:
        spark_df.unpersist()
        return
    
    spark_df_parsed = spark_df.withColumn("id_parts", parse_event_id_udf(col("event_id")))
    spark_df_parsed.cache() # Cache the parsed version as well

    # 1. Check for self-referential events
    self_ref_df = spark_df_parsed.filter(col("source_user_token") == col("target_user_token"))
    self_ref_count = self_ref_df.count()
    print("\n1. Events with same source and target user:")
    print(f"Total count: {self_ref_count}")
    if self_ref_count > 0:
        print("\nSample of such events:")
        print(format_validation_sample(self_ref_df.limit(3).toPandas(), "Self-referential event"))

    # 2. Check for duplicate event IDs
    dup_id_count, dup_mismatched_payload_count, swapped_count, non_swapped_count = analyze_duplicate_event_ids(spark_df_parsed)

    # 3. Check for mismatched timestamps
    timestamp_mismatch_df = spark_df_parsed.filter(col("event_time_millis") != col("effective_at_millis"))
    timestamp_mismatch_count = timestamp_mismatch_df.count()
    print("\n\n3. Events with event_time_millis != effective_at_millis:")
    print(f"Total count: {timestamp_mismatch_count}")
    if timestamp_mismatch_count > 0:
        print("\nSample of such events:")
        print(format_validation_sample(timestamp_mismatch_df.limit(3).toPandas(), "Mismatched root timestamp"))

    # 4. Check for node type mismatches
    node_type_mismatch_df = spark_df_parsed.filter(col("id_parts.id_node_type") != col("changed_node_type"))
    node_type_mismatch_count = node_type_mismatch_df.count()
    print("\n\n4. Events where changed_node_type (payload) != id_node_type (from event_id):") 
    print(f"Total count: {node_type_mismatch_count}")
    if node_type_mismatch_count > 0:
        print("\nSample of such events:") 
        print(format_validation_sample(node_type_mismatch_df.limit(3).toPandas(), "Node type mismatch (payload vs ID)"))

    # 5. Check for user token mismatches
    token_mismatch_df = spark_df_parsed.filter(
        ((col("id_parts.id_token1") != col("target_user_token")) & (col("id_parts.id_token1") != col("source_user_token"))) | \
        ((col("id_parts.id_token2") != col("target_user_token")) & (col("id_parts.id_token2") != col("source_user_token")))
    )
    token_mismatch_count = token_mismatch_df.count()
    print("\n\n5. Events where source/target tokens (payload) don't match event_id structure:") 
    print(f"Total count: {token_mismatch_count}")
    if token_mismatch_count > 0:
        print("\nSample of such events:") 
        print(format_validation_sample(token_mismatch_df.limit(3).toPandas(), "Token mismatch (payload vs ID)"))

    # Print summary
    print("\n\nValidation Summary:")
    print(f"Total events validated: {total_events}")
    print(f"Self-referential events: {self_ref_count}")
    print(f"Event IDs with duplicates: {dup_id_count}. Differing payloads: {dup_mismatched_payload_count} (Swapped: {swapped_count}, Non-Swapped Diff: {non_swapped_count})")
    #print(f"Event IDs with duplicates: {dup_id_count}. With mismarched payloads: {dup_mismatched_payload_count}")
    print(f"Events with timestamp mismatches: {timestamp_mismatch_count}")
    print(f"Events with node type mismatches: {node_type_mismatch_count}")
    print(f"Events with token mismatches: {token_mismatch_count}")

In [0]:
validate_bau_connection_change_events(ccgce_table, spark)

Total events: 5006462


1. Events with same source and target user:
Total count: 0


2. Event IDs that appear more than once:
  Total event_ids with at least one duplicate: 145799
  Number of these event_ids that have differing business payloads: 145797
    Number of event_ids with differing payloads that ARE SWAPPED: 4
    Number of event_ids with differing payloads that ARE NOT SWAPPED (truly different): 145793

  Sample of NON-SWAPPED duplicates with different payloads (showing all event instances for these event_ids):

  --- Example 1 for NON-SWAPPED differing Event ID: E/AH_01mge3c0g/AH_21ypqp2ag/CASH_DEVICE ---
    Instance 1:
      Source Token:      AH_01mge3c0g
      Target Token:      AH_21ypqp2ag
      Effective At:      1750910160394 (2025-06-26 03:56:00.394000)
      Published At:      1750910162867 (2025-06-26 03:56:02.867000)
      Source Labels (sorted): []
      Target Labels (sorted): []
      Source edge: C_g0c3egm10 -> CASH_DEVICE:F03729DD-4A79-47E5-9054-19BCC04F042

### Duplicate event_ids in output

A small number of duplicate `event_id`s were observed in edgy's ConnectionChange output events (~3%). Spot checking of a few examples suggests that these are often a result of Duplograph publishing an update to an existing edge with an earlier `effective_at` timestamp within seconds of when the original edge was created. Race conditions are likely in such scenarios, but overall edgy's behaviour appears to be correct, i.e. it publishes a ConnectionChange event with the earliest effective_at time it can find. Some of the duplicates are also caused by account holders who are connected by two different assets of the same type at the exact same time - in this scenario edgy is expected to publish a ConnectionChange event for both connections by design.