# Edgy BAU Label Validation

This notebook validates the CashConnectedGraphChangeEvent generation for BAU (live) LabelEvent processing through the Edgy service by comparing the events generated against expected events computed from Duplograph's graph state.

## Overview

The validation process:
1. Takes a list of label events as input containing:
   - Source account (CASH_CUSTOMER) token
   - Label type
   - Present value of the label (true or false)
   - Label effective timestamp
2. For each label event:
   - Finds the source account holder
   - Queries their active (non-merged) accounts
   - Queries their active labels before and after the current label event
   - If the label event represents no change in label state, expect no output events
   - Finds assets connected to the source account holder's accounts
   - Identifies other account holders connected to the same assets via active (non merged) accounts
   - Computes expected LabelChange CashConnectedGraphChangeEvents based on shared asset types
   - Validates generated event IDs against expected events to ensure all expected events are present, and no unexpected events are generated
   - Validates the connections between source and target in LabelChange.connection_node_types 
   - Validates the timestamp of the events matches the timestamp of the label event
3. Performs general sanity checks on the generated events

## Setup

First, install required dependencies and import necessary modules.

In [0]:
%pip install --extra-index-url https://artifactory.global.square/artifactory/api/pypi/block-pypi/simple sq-pysnowflake==1.6.0

Python interpreter will be restarted.
Time to validate event C_5jefchmj8/BLACKLISTED/True/1750946683067: 23.08 seconds
ℹ️ Label Event C_5jefchmj8/BLACKLISTED/True/1750946683067: Source AH AH_8jmhcfej5 has no connected target AHs. No events expected.

Processing input event 3/10: C_g0d4aqmrs/BLACKLISTED/True/1750937263201, effective_at=2025-06-26 11:27:43.201000
Looking in indexes: https://pypi.org/simple, https://artifactory.global.square/artifactory/api/pypi/block-pypi/simple
Time to validate event C_s6eaaams6/BLACKLISTED/True/1750945651685: 29.98 seconds
ℹ️ Label Event C_s6eaaams6/BLACKLISTED/True/1750945651685: Source AH AH_6smaaae6s has no *active* connected target AHs. No events expected.

Validation complete: 10/10 input label events passed.

Timing statistics:
- Average time per input label event: 39.56 seconds
- Fastest validation: 13.93 seconds
- Slowest validation: 75.90 seconds
- Total validation time for this batch: 395.62 seconds

Results written to: bau_label_validation/b

In [0]:
import pandas as pd
import numpy as np
import csv
import fcntl
import json
import os
import time
import math
from datetime import datetime, timedelta
from typing import List, Set, Dict, Tuple, Optional
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures
from pyspark.sql.functions import col
from pysnowflake import Session
from IPython.display import display, HTML

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)



## Helper Functions

### Query helpers

In [0]:
# Define a consistent datetime format for all Snowflake SQL queries
SNOWFLAKE_DT_FORMAT = '%Y-%m-%d %H:%M:%S.%f'

warehouse_2xl = 'etl__2xlarge'
warehouse_large = 'etl__large'
warehouse_medium = 'etl__medium'

def snowflake_query(query: str) -> pd.DataFrame:
    """Execute a query in Snowflake and return results as a pandas DataFrame."""
    with Session(query_tag="ioanna", connection_override_args={'warehouse': warehouse_large}) as sess:
        cursor = sess.execute(query)
        return cursor.fetch_pandas_all()

def spark_query(query: str) -> pd.DataFrame:
    """Execute a query in Spark and return results as a pandas DataFrame."""
    return spark.sql(query).toPandas()
        
def to_sql_list(inputs):
    """Convert a set/list of values to SQL IN clause format"""
    if not inputs:
        return "()"
    return "('" + "','".join(str(x) for x in inputs) + "')"

### Validation Result Class
Tracks validation outcomes and error details for each input label event

In [0]:
class ValidationResult:
    def __init__(self, label_event_id: str): 
        self.label_event_id = label_event_id 
        self.validation_status: Optional[str] = None
        self.missing_events: Set[str] = set()
        self.unexpected_events: Set[str] = set()
        self.timestamp_errors: List[str] = []
        self.connection_type_errors: List[str] = []
        self.payload_errors: List[str] = []
        self.success: bool = True
        
    def add_error(self, error_type: str, details: str):
        """Records an error of the specified type with details"""
        self.success = False
        if error_type == "validation_status":
            self.validation_status = details
        elif error_type == "missing_event":
            self.missing_events.add(details)
        elif error_type == "unexpected_event":
            self.unexpected_events.add(details)
        elif error_type == "timestamp":
            self.timestamp_errors.append(details)
        elif error_type == "connection_type":
            self.connection_type_errors.append(details)
        elif error_type == "payload_error":
            self.payload_errors.append(details)
            
    def get_summary(self) -> str:
        """Generates a human-readable summary of validation results"""
        summary = []
        
        identifier = self.label_event_id

        if self.validation_status:
            summary.append(f"ℹ️ Label Event {identifier}: {self.validation_status}")
        elif self.success:
            summary.append(f"✅ Label Event {identifier}: All validations passed")
        else:
            summary.append(f"❌ Label Event {identifier}: Validation failed")
        
        if self.missing_events:
            summary.append(f"\nMissing events ({len(self.missing_events)}):")
            for event in sorted(self.missing_events):
                summary.append(f"  - {event}")
        if self.unexpected_events:
            summary.append(f"\nUnexpected events ({len(self.unexpected_events)}):")
            for event in sorted(self.unexpected_events):
                summary.append(f"  - {event}")
        if self.timestamp_errors:
            summary.append(f"\nTimestamp errors ({len(self.timestamp_errors)}):")
            for error in self.timestamp_errors:
                summary.append(f"  - {error}")
        if self.connection_type_errors:
            summary.append(f"\nConnection Type errors ({len(self.connection_type_errors)}):")
            for error in self.connection_type_errors:
                summary.append(f"  - {error}")
        if self.payload_errors:
            summary.append(f"\nPayload errors ({len(self.payload_errors)}):")
            for error in self.payload_errors:
                summary.append(f"  - {error}")
        return "\n".join(summary)

    def write_to_csv(self, csv_path: str):
        """Writes this validation result to CSV in a thread-safe manner.
        
        Args:
            csv_path: Path to the CSV file
        """
        row_data = {
            'label_event_id': self.label_event_id,
            'validation_success': self.success,
            'validation_status': self.validation_status if self.validation_status else \
                               ('PASS' if self.success else 'FAIL'),
            'missing_events': '|'.join(sorted(self.missing_events)) if self.missing_events else '',
            'unexpected_events': '|'.join(sorted(self.unexpected_events)) if self.unexpected_events else '',
            'timestamp_errors': '|'.join(self.timestamp_errors) if self.timestamp_errors else '',
            'connection_type_errors': '|'.join(self.connection_type_errors) if self.connection_type_errors else '',
            'payload_errors': '|'.join(self.payload_errors) if self.payload_errors else ''
        }
        
        fieldnames = [
            'label_event_id',
            'validation_success',
            'validation_status',
            'missing_events',
            'unexpected_events',
            'timestamp_errors',
            'connection_type_errors',
            'payload_errors'
        ]
        
        file_exists = os.path.exists(csv_path)
        
        with open(csv_path, 'a' if file_exists else 'w', newline='') as f:
            fcntl.flock(f.fileno(), fcntl.LOCK_EX)
            try:
                writer = csv.DictWriter(f, fieldnames=fieldnames)
                
                if not file_exists:
                    writer.writeheader()
                
                writer.writerow(row_data)
                
                f.flush()
                os.fsync(f.fileno())
            finally:
                fcntl.flock(f.fileno(), fcntl.LOCK_UN)

### Event ID Generation
Replicates Edgy's LabelChange event ID generation logic for consistency checking.

In [0]:
def label_event_type_marker(event_type_str: str) -> str:
    """Converts a label event type string to its ID marker."""
    if event_type_str == "LABEL_ADDED":
        return "+"
    elif event_type_str == "LABEL_REMOVED":
        return "-"
    else:
        raise ValueError(f"Unsupported label event type for ID generation: {event_type_str}")

def label_change_event_id(
    target_token: str, 
    source_token: str, 
    label_type_name: str, # e.g., "SUSPICIOUS_ACTIVITY"
    effective_at_millis: int, 
    event_type_str: str # e.g., "LABEL_ADDED" or "LABEL_REMOVED"
) -> str:
    """
    Produces a LabelChange event_id in the same format as edgy.
    Format: L/<targetUserToken>/<sourceUserToken>/<labelTypeName>/<effectiveAtMillis>/{+,-}
    Note: Tokens are NOT alphabetically sorted for LabelChange event IDs.
    """
    marker = label_event_type_marker(event_type_str)
    return f"L/{target_token}/{source_token}/{label_type_name}/{effective_at_millis}/{marker}"


### Timestamp validation

Validation of change event timestamps

In [0]:
def validate_timestamps(
    actual_events_df: pd.DataFrame, 
    expected_effective_at_millis: int, 
    validation_result: ValidationResult
) -> None:
    """
    Validates that all actual LabelChange events have the expected effective_at_millis.
    Assumes actual_events_df is not empty and contains required columns if called.

    Args:
        actual_events_df: DataFrame of actual LabelChange CashConnectedGraphChangeEvents.
                          Expected columns: ['event_id', 'effective_at_millis'].
        expected_effective_at_millis: The expected effective_at_millis for these events,
                                      derived from the input label event.
        validation_result: ValidationResult object to store any errors.
    """

    # Find rows with mismatched timestamps
    mismatched_df = actual_events_df[
        actual_events_df['effective_at_millis'] != expected_effective_at_millis
    ]

    if not mismatched_df.empty:
        expected_dt_str = pd.to_datetime(expected_effective_at_millis, unit='ms').strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
        
        for _, event_row in mismatched_df.iterrows():
            actual_event_timestamp_millis = event_row['effective_at_millis']
            actual_dt_str = pd.to_datetime(actual_event_timestamp_millis, unit='ms').strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
            
            validation_result.add_error(
                "timestamp",
                f"Timestamp mismatch for event {event_row['event_id']}:\n"
                f"  Expected (from input label event): {expected_dt_str} ({expected_effective_at_millis} ms)\n"
                f"  Actual (in generated event): {actual_dt_str} ({actual_event_timestamp_millis} ms)"
            )

### Label Retrieval and Impact Assessment
Functions to retrieve active labels and determine the impact of a label event on the account holder's active labels

In [0]:
def get_account_holder_labels(
    account_holder_accounts: Dict[str, Set[str]],
    effective_at: str
) -> pd.DataFrame:
    """Gets active labels for account holders via their accounts.
    
    Args:
        account_holder_accounts: Dict mapping account holder tokens to their active account tokens
        effective_at: Timestamp for point-in-time query
        
    Returns:
        DataFrame with label information
    """
    # Flatten account tokens for the query
    all_accounts = {
        account 
        for accounts in account_holder_accounts.values() 
        for account in accounts
    }
    
    if not all_accounts:
        return pd.DataFrame(columns=[
            'ACCOUNT_HOLDER_TOKEN', 'ACCOUNT_TOKEN', 'LABEL',
            'LABEL_EFFECTIVE_AT', 'ACCOUNT_MAPPING_EFFECTIVE_AT'
        ])

    query = f"""
    WITH latest_label_events AS (
        SELECT node_token,
               label,
               present,
               effective_at,
               ROW_NUMBER() OVER (PARTITION BY node_token, label ORDER BY effective_at DESC) as rn
        FROM duplograph.public.label_events
        WHERE node_type = 'CASH_CUSTOMER'
        AND effective_at <= '{effective_at}'
        AND node_token IN {to_sql_list(all_accounts)}
    )
    SELECT 
        m.from_token AS ACCOUNT_HOLDER_TOKEN,
        m.to_token AS ACCOUNT_TOKEN,
        l.label AS LABEL,
        l.effective_at AS LABEL_EFFECTIVE_AT,
        m.effective_at AS ACCOUNT_MAPPING_EFFECTIVE_AT
    FROM latest_label_events l
    JOIN duplograph.public.edges m 
        ON l.node_token = m.to_token
    WHERE l.rn = 1 
    AND l.present = true
    AND m.from_type = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
    AND m.to_type = 'CASH_CUSTOMER'
    AND m.effective_at <= '{effective_at}'
    AND m.from_token IN {to_sql_list(account_holder_accounts.keys())}
    ORDER BY 
        m.from_token,
        m.to_token,
        l.label
    """
    
    return snowflake_query(query)

In [0]:
def get_label_event_impact(
    source_ah_token: str,
    source_ah_accounts: Set[str],
    label_type: str,
    effective_at_str: str # Expects formatted string, matching SNOWFLAKE_DT_FORMAT
) -> str:
    """
    Determines if an input label event results in an effective label addition, removal,
    or no net change to the source account holder's aggregated active labels for that specific label type.

    It compares the set of active labels for the account holder *before* and *at* the
    provided effective_at_str.

    Args:
        source_ah_token: The token of the source account holder.
        source_ah_accounts: A set of active account tokens for the source account holder.
        label_type: The label type from the input label event (e.g., "SUSPICIOUS_ACTIVITY").
        effective_at_str: The effective_at timestamp as a string, formatted according
                          to SNOWFLAKE_DT_FORMAT. This represents the "at" state.
                          The "before" state is calculated by subtracting 1ms.

    Returns:
        A string: "LABEL_ADDED", "LABEL_REMOVED", or "NO_CHANGE".
    """
    # ACCOUNT_DENYLISTED applies only to accounts, it should produce no events for account holders
    if label_type == "ACCOUNT_DENYLISTED":
        return "NO CHANGE"

    event_effective_at_dt = pd.to_datetime(effective_at_str) # Convert to datetime for timedelta

    ah_accounts_map = {source_ah_token: source_ah_accounts}

    # 1. Determine "Before" state
    effective_at_before_dt = event_effective_at_dt - timedelta(milliseconds=1)
    effective_at_before_query_str = effective_at_before_dt.strftime(SNOWFLAKE_DT_FORMAT)
    
    df_before = get_account_holder_labels(
        account_holder_accounts=ah_accounts_map,
        effective_at=effective_at_before_query_str
    )
    active_labels_before = set(df_before['LABEL'].unique()) if not df_before.empty else set()

    # 2. Determine "At" state (using the original effective_at_str)
    df_at = get_account_holder_labels(
        account_holder_accounts=ah_accounts_map,
        effective_at=effective_at_str 
    )
    active_labels_at = set(df_at['LABEL'].unique()) if not df_at.empty else set()
            
    label_active_before = label_type in active_labels_before
    label_active_at = label_type in active_labels_at

    if label_active_at and not label_active_before:
        return "LABEL_ADDED"
    elif not label_active_at and label_active_before: 
        return "LABEL_REMOVED"
    else:
        return "NO_CHANGE"

### Account Holder Functions
Functions to analyze account holders and their connections.

In [0]:
def get_ah_for_account(
    account_token: str,
    effective_at: str
) -> Optional[str]:
    """Gets the account holder for a CASH_CUSTOMER account.
    
    Args:
        account_token: The CASH_CUSTOMER token to find the account holder for
        effective_at: Timestamp for point-in-time query
        
    Returns:
        The ACCOUNT_HOLDER_CASH_CUSTOMER token or None if not found
    """
    query = f"""
    SELECT FROM_TOKEN as ACCOUNT_HOLDER_TOKEN
    FROM duplograph.public.edges
    WHERE FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
    AND TO_TYPE = 'CASH_CUSTOMER'
    AND TO_TOKEN = '{account_token}'
    AND EFFECTIVE_AT <= '{effective_at}'
    """
    
    df = snowflake_query(query)
    return df['ACCOUNT_HOLDER_TOKEN'].iloc[0] if not df.empty else None


In [0]:
def get_all_active_accounts(
    account_holders: Set[str],
    effective_at: str
) -> Dict[str, Set[str]]:
    """Gets all active (non-merged) accounts for a set of account holders.
    
    Args:
        account_holders: Set of account holder tokens to get accounts for
        effective_at: Timestamp for point-in-time query
        
    Returns:
        Dictionary mapping account holder tokens to their set of active (non-merged) account tokens.
        Account holders with no active accounts will have empty sets or be missing from the dictionary.
    """
    if not account_holders:
        return {}
        
    query = f"""
    WITH account_edges AS (
        -- Get all account mappings for the account holders
        SELECT 
            FROM_TOKEN AS account_holder_token,
            TO_TOKEN AS account_token,
            EFFECTIVE_AT AS mapping_effective_at
        FROM duplograph.public.edges
        WHERE FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
        AND TO_TYPE = 'CASH_CUSTOMER'
        AND FROM_TOKEN IN {to_sql_list(account_holders)}
        AND EFFECTIVE_AT <= '{effective_at}'
    )
    SELECT 
        ae.account_holder_token,
        ae.account_token
    FROM account_edges ae
    WHERE NOT EXISTS (
        -- Exclude merged accounts
        SELECT 1
        FROM duplograph.public.edges m
        WHERE m.FROM_TOKEN = ae.account_token
        AND m.FROM_TYPE = 'CASH_CUSTOMER'
        AND m.TO_TYPE = 'CASH_CUSTOMER'
        AND m.EFFECTIVE_AT <= '{effective_at}'
    )
    """
    
    df = snowflake_query(query)
    
    # Convert to dictionary mapping account holders to their account sets
    result = {}
    if not df.empty:
        for account_holder in account_holders:
            accounts = df[df['ACCOUNT_HOLDER_TOKEN'] == account_holder]['ACCOUNT_TOKEN'].tolist()
            result[account_holder] = set(accounts)
    
    return result

In [0]:
def get_connected_account_holders(
    source_account_holder: str,
    source_active_accounts: Set[str],
    effective_at: str,
    asset_types: Optional[Set[str]] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, Set[str]]:
    """Gets account holders connected to source accounts via shared assets.
    Note: Returns all connected accounts, including merged ones.
    Merged account filtering should be done by the caller.
    
    Args:
        source_account_holder: The source account holder token
        source_active_accounts: Set of active (non-merged) account tokens belonging to the source account holder
        effective_at: Timestamp for point-in-time query
        asset_types: Optional set of asset types to filter by
        
    Returns:
        Tuple of (
            asset_connections: DataFrame with asset connection details including:
                - ACCOUNT_TOKEN: The connected account
                - ASSET_TOKEN: The shared asset
                - ASSET_TYPE: Type of the shared asset
                - ASSET_CONNECTION_TIME: When the account connected to the asset
                - SOURCE_CONNECTION_TIME: When the source account connected to the asset
                - ACCOUNT_HOLDER_TOKEN: The account holder owning the connected account
                - EARLIEST_CONNECTION_TIME: Later of source and target connection times
            holder_mapping: DataFrame mapping account holders to their connected accounts:
                - ACCOUNT_HOLDER_TOKEN: The account holder
                - ACCOUNT_TOKEN: Their account that shares an asset
                - MAPPING_EFFECTIVE_AT: When the account was mapped to the holder
            connected_holders: Set of account holder tokens that share assets with the source
        )
    """
    if not source_active_accounts:
        return pd.DataFrame(), pd.DataFrame(), set()
        
    asset_type_clause = "1=1"
    if asset_types:
        asset_type_clause = f"TO_TYPE IN {to_sql_list(asset_types)}"
        
    query = f"""
    WITH source_assets AS (
        -- Get assets connected to source accounts
        SELECT DISTINCT 
            TO_TOKEN,
            TO_TYPE,
            MIN(EFFECTIVE_AT) as SOURCE_CONNECTION_TIME  -- When source first connected
        FROM duplograph.public.edges
        WHERE FROM_TYPE = 'CASH_CUSTOMER'
        AND FROM_TOKEN IN {to_sql_list(source_active_accounts)}
        AND TO_TYPE != 'CASH_CUSTOMER'
        AND {asset_type_clause}
        AND EFFECTIVE_AT <= '{effective_at}'
        GROUP BY TO_TOKEN, TO_TYPE
    ),
    connected_accounts AS (
        -- Get accounts connected to those assets, excluding source accounts
        SELECT DISTINCT
            e.FROM_TOKEN AS ACCOUNT_TOKEN,
            e.TO_TOKEN AS ASSET_TOKEN,
            e.TO_TYPE AS ASSET_TYPE,
            MIN(e.EFFECTIVE_AT) AS ASSET_CONNECTION_TIME,  -- When target first connected
            MIN(sa.SOURCE_CONNECTION_TIME) AS SOURCE_CONNECTION_TIME
        FROM duplograph.public.edges e
        JOIN source_assets sa ON e.TO_TOKEN = sa.TO_TOKEN 
            AND e.TO_TYPE = sa.TO_TYPE
        WHERE e.FROM_TYPE = 'CASH_CUSTOMER'
        AND e.FROM_TOKEN NOT IN {to_sql_list(source_active_accounts)}
        AND e.EFFECTIVE_AT <= '{effective_at}'
        GROUP BY e.FROM_TOKEN, e.TO_TOKEN, e.TO_TYPE
    ),
    account_holders AS (
        -- Get account holder relationships for connected accounts
        SELECT 
            e.FROM_TOKEN AS ACCOUNT_HOLDER_TOKEN,
            e.TO_TOKEN AS ACCOUNT_TOKEN,
            e.EFFECTIVE_AT AS MAPPING_EFFECTIVE_AT
        FROM duplograph.public.edges e
        WHERE e.FROM_TYPE = 'ACCOUNT_HOLDER_CASH_CUSTOMER'
        AND e.TO_TYPE = 'CASH_CUSTOMER'
        AND e.TO_TOKEN IN (SELECT ACCOUNT_TOKEN FROM connected_accounts)
        AND e.FROM_TOKEN != '{source_account_holder}'
        AND e.EFFECTIVE_AT <= '{effective_at}'
    ),
    connections AS (
        -- Join with account holders and get earliest connection times
        SELECT 
            ca.*,
            ah.ACCOUNT_HOLDER_TOKEN,
            ah.MAPPING_EFFECTIVE_AT,
            -- For each account holder, find the earliest effective connection time
            -- (the later of source connection and target connection)
            MIN(GREATEST(ca.ASSET_CONNECTION_TIME, ca.SOURCE_CONNECTION_TIME)) 
                OVER (PARTITION BY ah.ACCOUNT_HOLDER_TOKEN, ca.ASSET_TYPE) AS EARLIEST_CONNECTION_TIME
        FROM connected_accounts ca
        JOIN account_holders ah ON ca.ACCOUNT_TOKEN = ah.ACCOUNT_TOKEN
    )
    SELECT * FROM connections
    ORDER BY ACCOUNT_HOLDER_TOKEN, ASSET_TOKEN
    """
    
    df = snowflake_query(query)
    
    if df.empty:
        return df, pd.DataFrame(), set()
        
    # Extract unique account holder tokens
    connected_holders = set(df['ACCOUNT_HOLDER_TOKEN'].unique())
    
    # Create account holder mapping DataFrame
    holder_mapping = df[['ACCOUNT_HOLDER_TOKEN', 'ACCOUNT_TOKEN', 'MAPPING_EFFECTIVE_AT']].drop_duplicates()
    
    # Create asset connections DataFrame
    asset_connections = df[[
        'ACCOUNT_TOKEN', 'ASSET_TOKEN', 'ASSET_TYPE', 'ASSET_CONNECTION_TIME',
        'SOURCE_CONNECTION_TIME', 'ACCOUNT_HOLDER_TOKEN', 'EARLIEST_CONNECTION_TIME'
    ]].drop_duplicates()
    
    return asset_connections, holder_mapping, connected_holders

## Main Validation Logic

The core validation function that orchestrates the validation process for each input label event.

In [0]:
def validate_no_events_condition(
    label_event_id: str, # CHANGED from account_holder
    actual_events: Set[str],
    status_message: str
) -> ValidationResult:
    """Validates cases where no events should be present for a given input label event.
    
    Args:
        label_event_id: The identifier for the input label event being validated.
        actual_events: Set of actual event IDs found for this input.
        status_message: Status message explaining why no events are expected.
        
    Returns:
        ValidationResult: Result object with appropriate status and any unexpected events.
    """
    result = ValidationResult(label_event_id)
    
    if actual_events:
        result.add_error("validation_status", status_message)
        for event_id_val in actual_events: # Renamed event_id to event_id_val to avoid conflict
            result.add_error("unexpected_event", event_id_val)
    else:
        result.success = True
        result.validation_status = status_message
    
    return result

In [0]:
def get_actual_label_change_events(
    account_token: str,
    label_type: str, 
    present: bool,
    effective_at_ms: int,
    ccgce_table: str
) -> pd.DataFrame:
    """
    Fetches actual LabelChange CashConnectedGraphChangeEvents from the specified table,
    filtered by source account and label type that changed on the source and the
    effective_at time of the label change.

    Args:
        account_token: The source account holder token.
        label_type: The label type expected in event.label_change.changed_source_user_label.
        present: Boolean indicating whether the label was present or not
        effective_at_ms: The effective_at timestamp of the label change in millis
        ccgce_table: The Databricks table containing CashConnectedGraphChangeEvents.

    Returns:
        Pandas DataFrame of matching LabelChange events.
    """
    query = f"""
    SELECT 
        cash_connected_graph_change_event.event_id as event_id,
        cash_connected_graph_change_event.source_user_token as source_user_token,
        cash_connected_graph_change_event.target_user_token as target_user_token,
        cash_connected_graph_change_event.effective_at_millis as effective_at_millis,
        cash_connected_graph_change_event.published_at_millis as published_at_millis,
        cash_connected_graph_change_event.event_type as event_type,
        cash_connected_graph_change_event.label_change.changed_source_user_label as changed_source_user_label,
        cash_connected_graph_change_event.label_change.connection_node_types as connection_node_types
    FROM {ccgce_table}
    WHERE cash_connected_graph_change_event.label_change.source_label_event.node.token = '{account_token}'
    AND cash_connected_graph_change_event.label_change.source_label_event.label = '{label_type}'
    AND cash_connected_graph_change_event.label_change.source_label_event.present = '{present}'
    AND cash_connected_graph_change_event.label_change.source_label_event.effective_at_msec = '{effective_at_ms}'
    """
    actual_events_df = spark_query(query)
    return actual_events_df


In [0]:
def validate_connection_types(
    actual_events_df: pd.DataFrame, # Contains event_id and actual 'connection_node_types' (list)
    expected_connection_types_map: Dict[str, Set[str]], # Maps event_id to a SET of expected node type strings
    validation_result: ValidationResult
) -> None:
    """
    Validates the connection_node_types in LabelChange events.

    Args:
        actual_events_df: DataFrame of actual LabelChange events to validate.
                          Expected columns: ['event_id', 'connection_node_types'].
        expected_connection_types_map: A dictionary mapping event_id to a SET
                                       of expected connection node type strings.
        validation_result: ValidationResult object to store errors.
    """
    if actual_events_df.empty:
        return

    required_columns = ['event_id', 'connection_node_types']
    if not all(col in actual_events_df.columns for col in required_columns):
        validation_result.add_error(
            "connection_type", # Using the error type we defined in ValidationResult
            f"actual_events_df missing required columns for connection_types validation. Expected: {required_columns}"
        )
        return

    for _, event_row in actual_events_df.iterrows():
        event_id = event_row['event_id']
        actual_types_set = set(event_row['connection_node_types']) # This is a list from the event payload

        expected_types_set = expected_connection_types_map.get(event_id)

        if expected_types_set is None:
            # This case should ideally not happen if actual_events_df is already filtered
            # to include only events for which we have expectations.
            # However, if it does, it means we have an actual event we didn't form expectations for.
            # This might be an "unexpected event" already, or a logic error in preparing expected_connection_types_map.
            # For now, we can log it if it occurs.
            validation_result.add_error(
                "connection_type",
                f"Event {event_id} found in actual_events_df but no expected connection types were prepared for it."
            )
            continue

        if actual_types_set != expected_types_set:
            validation_result.add_error(
                "connection_type",
                f"Connection type mismatch for event {event_id}:\n"
                f"  Expected: {sorted(expected_types_set)}\n"
                f"  Actual:   {sorted(actual_types_set)}"
            )

In [0]:
def validate_bau_label_event(
    node_token: str,
    label_type: str,
    label_present: bool,
    effective_at_ms: int, 
    ccgce_table: str
) -> ValidationResult:
    """
    Orchestrates the validation process for a single input BAU label event.

    It determines the impact of the label event on the source account holder's
    active labels and compares generated CashConnectedGraphChangeEvents (if any)
    against expected events computed from Duplograph's state.

    Args:
        node_token: The CASH_CUSTOMER token from the input label event.
        label_type: The label type string from the input label event.
        label_present: Boolean indicating if the label is present in the input event.
        effective_at_ms: The effective_at timestamp of the input label event, as epoch milliseconds.
        ccgce_table: The Databricks table containing actual Edgy-generated
                     CashConnectedGraphChangeEvents.

    Returns:
        A ValidationResult object summarizing the outcome for this input label event.
    """
    label_event_id = f"{node_token}/{label_type}/{label_present}/{effective_at_ms}"

    validation_result = ValidationResult(label_event_id)

    effective_at_dt = pd.to_datetime(effective_at_ms, unit='ms')
    effective_at_str = effective_at_dt.strftime(SNOWFLAKE_DT_FORMAT)

    # Fetch LabelChange events that have this LabelEvent as their source event
    actual_events_df = get_actual_label_change_events(
        account_token=node_token,
        label_type=label_type,
        present=label_present,
        effective_at_ms=effective_at_ms,
        ccgce_table=ccgce_table
    )
    actual_event_id_set = set(actual_events_df['event_id'])

    source_ah_token = get_ah_for_account(node_token, effective_at_str)

    if not source_ah_token:
        return validate_no_events_condition(
            label_event_id, actual_event_id_set,
            f"Could not find AH for CASH_CUSTOMER {node_token} at {effective_at_str}."
        )

    source_ah_accounts_map = get_all_active_accounts({source_ah_token}, effective_at_str)
    source_ah_active_accounts = source_ah_accounts_map.get(source_ah_token, set())

    if not source_ah_active_accounts:
        return validate_no_events_condition(
            label_event_id, actual_event_id_set,
            f"Source AH {source_ah_token} has no active accounts. No LabelChange events expected."
        )

    label_impact = get_label_event_impact(
        source_ah_token, source_ah_active_accounts, label_type, effective_at_str
    )

    if label_impact == "NO_CHANGE":
        return validate_no_events_condition(
            label_event_id, actual_event_id_set,
            f"Input label event resulted in NO_CHANGE to AH {source_ah_token}'s active labels for {label_type}. No events expected."
        )
    
    asset_connections_df, _, connected_target_ahs_all = get_connected_account_holders(
        source_ah_token, source_ah_active_accounts, effective_at_str
    )

    if not connected_target_ahs_all:
        return validate_no_events_condition(
            label_event_id, actual_event_id_set,
            f"Source AH {source_ah_token} has no connected target AHs. No events expected."
        )
        
    target_ahs_accounts_map = get_all_active_accounts(connected_target_ahs_all, effective_at_str)
    active_target_ahs = { ah for ah, accs in target_ahs_accounts_map.items() if accs }

    if not active_target_ahs:
        return validate_no_events_condition(
            label_event_id, actual_event_id_set,
            f"Source AH {source_ah_token} has no *active* connected target AHs. No events expected."
        )

    expected_event_ids = set()
    expected_connection_types_map = {} 

    for target_ah_token in active_target_ahs:
        shared_assets_for_this_target_df = asset_connections_df[
            asset_connections_df['ACCOUNT_HOLDER_TOKEN'] == target_ah_token
        ]
        if not shared_assets_for_this_target_df.empty:
            shared_asset_type_names = set(shared_assets_for_this_target_df['ASSET_TYPE'].unique())
            if shared_asset_type_names:
                event_id = label_change_event_id(
                    target_token=target_ah_token, source_token=source_ah_token,
                    label_type_name=label_type, effective_at_millis=effective_at_ms,
                    event_type_str=label_impact 
                )
                expected_event_ids.add(event_id)
                expected_connection_types_map[event_id] = shared_asset_type_names

    if not expected_event_ids: # This means no shared assets with *active* targets generated any expected event_ids
        return validate_no_events_condition(
            label_event_id, actual_event_id_set, # Check against all related actuals
            f"No expected events generated for AH {source_ah_token} (e.g. no shared assets with active targets). No events expected."
        )

    missing = expected_event_ids - actual_event_id_set 
    for eid in missing: validation_result.add_error("missing_event", eid)
    
    # Unexpected events are those in the fetched 'actual_event_id_set'
    # that are not in our 'expected_event_ids'.
    unexpected = actual_event_id_set - expected_event_ids
    for eid in unexpected: validation_result.add_error("unexpected_event", eid)

    events_to_validate_ids = expected_event_ids.intersection(actual_event_id_set)
    
    if events_to_validate_ids:
        events_to_validate_df = actual_events_df[actual_events_df['event_id'].isin(events_to_validate_ids)].copy()

        if not events_to_validate_df.empty:
            mismatched_label_payload_df = events_to_validate_df[
                events_to_validate_df['changed_source_user_label'] != label_type
            ]
            for _, event_row in mismatched_label_payload_df.iterrows():
                validation_result.add_error(
                    "payload_error", 
                    f"Event {event_row['event_id']} has incorrect changed_source_user_label: "
                    f"Expected '{label_type}', Actual '{event_row['changed_source_user_label']}'"
                )
            
            validate_timestamps(
                events_to_validate_df, effective_at_ms, validation_result
            )

            validate_connection_types(
                events_to_validate_df, expected_connection_types_map, validation_result
            )

    if validation_result.success and not validation_result.validation_status:
         validation_result.validation_status = "All validations passed"

    return validation_result

In [0]:
def validate_label_events(
    label_events_set: Set[Tuple[str, str, bool, int]],
    ccgce_table: str,
    output_dir: Optional[str] = None,
    csv_path: Optional[str] = None
) -> List[ValidationResult]:
    """Validates Edgy BAU label event processing for multiple input label events.
    
    Args:
        label_events_set: Set of Tuples, where each Tuple contains
                               the details of an input label event in the order:
                               (node_token, label_type, present, effective_at_millis)
        ccgce_table: The Databricks table containing CashConnectedGraphChangeEvents.
        output_dir: Optional directory for output files. Defaults to current directory.
        csv_path: Optional explicit path for the output CSV file.
        
    Returns:
        List of ValidationResult objects, one per input label event.
    """
    input_label_events_list = list(label_events_set)
    num_events_to_validate = len(input_label_events_list)
    print(f"Validating {num_events_to_validate} unique input label events.")
    for label_event in label_events_set:
        print(f"- {label_event[0]} -> {label_event[1]}={label_event[2]} @ {label_event[3]}")

    if output_dir:
        os.makedirs(output_dir, exist_ok=True)
    else:
        output_dir = '.'
            
    if csv_path is None:
        timestamp_str = datetime.now().strftime('%Y%m%d_%H%M%S')
        csv_path = os.path.join(output_dir, f'bau_label_validation_{timestamp_str}.csv') 
    
    all_results = []
    timings = {} 
    errors = []
    
    for i, event_tuple in enumerate(input_label_events_list):
        node_token, label_type, present, effective_at_millis = event_tuple
        label_event_id = f"{node_token}/{label_type}/{present}/{effective_at_millis}"
        # For printing, convert millis to datetime string
        effective_at_str = pd.to_datetime(effective_at_millis, unit='ms')
        print(f"\nProcessing input event {i+1}/{num_events_to_validate}: {label_event_id}, effective_at={effective_at_str}")
        
        start_time = time.time()
        
        try:
            result = validate_bau_label_event(
                node_token=node_token,
                label_type=label_type,
                label_present=present,
                effective_at_ms=effective_at_millis, 
                ccgce_table=ccgce_table
            )
            all_results.append(result)
            
            result.write_to_csv(csv_path) 
            
            end_time = time.time()
            duration = end_time - start_time
            timings[label_event_id] = duration
            
            print(f"Time to validate event {label_event_id}: {duration:.2f} seconds")
            print(result.get_summary())
        except Exception as e:
            error_msg = f"ERROR while validating label event {label_event_id}: {type(e).__name__}: {e}"
            print(error_msg)
            errors.append((label_event_id, e))
            # Optionally: Keep running or re-raise depending on workflow; here we keep running next events

    success_count = sum(1 for r in all_results if getattr(r, 'success', False))
    print(f"\nValidation complete: {success_count}/{num_events_to_validate} input label events passed.")
    
    if errors:
        print(f"\n{len(errors)} label events failed due to code errors:")
        for label_id, err in errors:
            print(f"  {label_id}: {type(err).__name__}: {err}")

    if timings:
        avg_time = sum(timings.values()) / len(timings)
        max_time_val = max(timings.values())
        min_time_val = min(timings.values())
        print(f"\nTiming statistics:")
        print(f"- Average time per input label event: {avg_time:.2f} seconds")
        print(f"- Fastest validation: {min_time_val:.2f} seconds")
        print(f"- Slowest validation: {max_time_val:.2f} seconds")
        print(f"- Total validation time for this batch: {sum(timings.values()):.2f} seconds")
    
    print(f"\nResults written to: {csv_path}")
    return all_results

In [0]:
def validate_label_events_batch(
    label_events_set: Set[Tuple[str, str, bool, int]],
    ccgce_table: str,
    output_dir: str,
    batch_size: int = 5,
    max_workers: int = 3
):
    """Validates input label events in parallel batches.
    
    Args:
        label_events_set: Set of Tuples representing input label events.
                                Each tuple: (node_token, label_type, present, effective_at_millis)
        ccgce_table: The Databricks table for CashConnectedGraphChangeEvents.
        output_dir: Directory for output CSV file.
        batch_size: Number of label events to process in each parallel batch.
        max_workers: Maximum number of parallel threads.
    """
    os.makedirs(output_dir, exist_ok=True)
    
    timestamp_str = datetime.now().strftime('%Y%m%d_%H%M%S')
    csv_path = os.path.join(output_dir, f'bau_label_validation_{timestamp_str}.csv') 
    
    input_label_events_list = list(label_events_set)
    num_total_events = len(input_label_events_list)
    num_batches = math.ceil(num_total_events / batch_size)
    
    batches = [
        set(input_label_events_list[i * batch_size:(i + 1) * batch_size])
        for i in range(num_batches)
    ]
    
    print(f"Processing {num_total_events} unique input label events in {num_batches} batches "
          f"of up to size {batch_size}, using {max_workers} workers.")
    print(f"Results will be written to: {csv_path}")
    
    start_time = time.time() 
    futures = []

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [
            executor.submit(
                validate_label_events, 
                batch, 
                ccgce_table,
                output_dir, 
                csv_path    
            )
            for batch in batches
        ]
        
        for i, future in enumerate(concurrent.futures.as_completed(futures)):
            try:
                future.result() 
                print(f"\nCompleted batch {i + 1}/{num_batches}")
            except Exception as e:
                print(f"\nBatch {i + 1} failed with error: {str(e)}")
    
    end_time = time.time()
    print(f"\nAll batches processed. Total time: {end_time - start_time:.2f} seconds.")
    print(f"Consolidated results are in: {csv_path}")

    return csv_path

## Validation Execution
Set up parameters, fetch sample input label events, and run the validation.

In [0]:
#ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250601"
#ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250610"
#ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250618"
ccgce_table = "cash_banking_ml_eng.cash_connected_graph_change_event.edgy_test_20250626"

In [0]:
# Obtain the date range of BAU events in ccgce_table that can be meaningfully validated 
query = f"""
SELECT MIN(cash_connected_graph_change_event.effective_at_millis) AS min_effective_at,
    MIN(cash_connected_graph_change_event.published_at_millis) AS min_published_at,
    MAX(cash_connected_graph_change_event.effective_at_millis) AS max_effective_at,
    MAX(cash_connected_graph_change_event.published_at_millis) AS max_published_at
    from {ccgce_table}
WHERE cash_connected_graph_change_event.event_source_type = 'BAU'
AND cash_connected_graph_change_event.label_change is not null
"""
date_range_df = spark_query(query)
# Use the max of the earliest effective_at and published_at timestamps as the start time
start_time = pd.to_datetime(max(date_range_df[["min_effective_at", "min_published_at"]].iloc[0]), unit="ms")
# Use the min of the latest effective_at and published_at timestamps as the finish time
finish_time = pd.to_datetime(min(date_range_df[["max_effective_at", "max_published_at"]].iloc[0]), unit="ms")
print(f"Edgy output time range to validate: {start_time} - {finish_time}")

Edgy output time range to validate: 2025-06-26 00:00:02.829000 - 2025-06-26 23:59:57.430000


In [0]:
# Fetch a sample of CASH_CUSTOMER labels from duplograph.public.label_events that were 
# created and became effective within the time range of the edgy output 
# (avoiding the boundaries in case of lag issues) 
query = f"""
SELECT 
    node_token, 
    label, 
    present, 
    effective_at
FROM duplograph.public.label_events 
WHERE node_type = 'CASH_CUSTOMER' 
AND effective_at BETWEEN '{start_time + pd.to_timedelta('8h')}' AND '{finish_time - pd.to_timedelta('8h')}'
AND created_at BETWEEN '{start_time + pd.to_timedelta('8h')}' AND '{finish_time - pd.to_timedelta('8h')}'
LIMIT 1000
"""
bau_labels_df = snowflake_query(query)
bau_labels_df

Unnamed: 0,NODE_TOKEN,LABEL,PRESENT,EFFECTIVE_AT
0,C_5xd810m63,HAS_ANY_CASH_DISPUTE,True,2025-06-26 09:22:03.000
1,C_sxepcqywh,BLACKLISTED,True,2025-06-26 11:31:40.928
2,C_fqd2kpmrq,BLACKLISTED,True,2025-06-26 08:03:53.813
3,C_5mc6dgmaj,BLACKLISTED,True,2025-06-26 09:47:25.337
4,C_st2xppm3m,BLACKLISTED,True,2025-06-26 11:26:28.780
...,...,...,...,...
995,C_g42w9qytp,BLACKLISTED,True,2025-06-26 10:23:34.680
996,C_5j20ahmp8,BLACKLISTED,True,2025-06-26 08:03:27.784
997,C_ss2aphyj1,BLACKLISTED,True,2025-06-26 13:26:26.645
998,C_5s2zx8mpe,HAS_CASH_CHECK_RETURNS,True,2025-06-26 09:21:27.994


In [0]:
# Convert DataFrame rows to a set of label event tuples for validation
label_events_to_validate = set(
    bau_labels_df
      .assign(
        EFFECTIVE_AT=lambda df: pd.to_datetime(df["EFFECTIVE_AT"]).astype("int64") // 1_000_000
      )
      [["NODE_TOKEN", "LABEL", "PRESENT", "EFFECTIVE_AT"]]
      .itertuples(index=False, name=None)
)

# Run validation in batches
output_file = validate_label_events_batch(
    label_events_set=label_events_to_validate,
    ccgce_table=ccgce_table,
    output_dir="bau_label_validation",
    batch_size=10, 
    max_workers=10
)

Processing 1000 unique input label events in 100 batches of up to size 10, using 10 workers.
Results will be written to: bau_label_validation/bau_label_validation_20250701_021110.csv
Validating 10 unique input label events.
- C_5nepxgmae -> HAS_CASH_CARD_DISPUTE=True @ 1750941726000
- C_59eb20mzg -> BLACKLISTED=True @ 1750941715950
- C_fscvx1mw7 -> HAS_CASH_CHECK_RETURNS=False @ 1750949045646
- C_8scjxhmee -> BLACKLISTED=True @ 1750933848926
- C_she8k8yr4 -> BLACKLISTED=True @ 1750944694979
- C_fa2dchyy2 -> HAS_CASH_CARD_CHARGED_WRONG_AMOUNT_DISPUTE=True @ 1750931647000
- C_54e1dqm68 -> BLACKLISTED=True @ 1750951248958
- C_5sebahmsz -> BLACKLISTED=True @ 1750939739817
- C_gydyd1ynh -> BLACKLISTED=True @ 1750950239355
- C_hx2tj1y42 -> HAS_CASH_CARD_DID_NOT_MAKE_TRANSACTION_DISPUTE=True @ 1750942568000

Processing input event 1/10: C_5nepxgmae/HAS_CASH_CARD_DISPUTE/True/1750941726000, effective_at=2025-06-26 12:42:06
Validating 10 unique input label events.
- C_hy2zrhydw -> BLACKLISTED=T

  return self._ctx.set_alpn_protos(protocols)


Time to validate event C_sm24pqm8v/BLACKLISTED/True/1750930878291: 22.94 seconds
ℹ️ Label Event C_sm24pqm8v/BLACKLISTED/True/1750930878291: All validations passed

Processing input event 5/10: C_n025k0ysh/BLACKLISTED/True/1750931382968, effective_at=2025-06-26 09:49:42.968000
Time to validate event C_hmctdpym8/HAS_CASH_CARD_AFT_REASON_CODE_DISPUTE/True/1750938458000: 22.37 seconds
ℹ️ Label Event C_hmctdpym8/HAS_CASH_CARD_AFT_REASON_CODE_DISPUTE/True/1750938458000: All validations passed

Processing input event 4/10: C_0tdnqhmah/HAS_ANY_CASH_DISPUTE/True/1750952985000, effective_at=2025-06-26 15:49:45
Time to validate event C_n0dv2amxq/BLACKLISTED/True/1750950897913: 28.62 seconds
❌ Label Event C_n0dv2amxq/BLACKLISTED/True/1750950897913: Validation failed

Unexpected events (5):
  - L/AH_1fyg4ee4s/AH_qxma2vd0n/BLACKLISTED/1750950897913/+
  - L/AH_3xy0py2nh/AH_qxma2vd0n/BLACKLISTED/1750950897913/+
  - L/AH_85yhjzcmf/AH_qxma2vd0n/BLACKLISTED/1750950897913/+
  - L/AH_d0y0dv2ys/AH_qxma2vd0n

In [0]:
# Path to store validation outputs under. Create this beforehand with dbutils.fs.mkdirs(f"dbfs:/Users/...")
s3_dir = "Users/ioanna"

# Copy output to S3 for persistence
output_path = f"{s3_dir}/{output_file}"
dbutils.fs.cp(f"file://{os.getcwd()}/{output_file}", f"dbfs:/{output_path}")

Out[22]: True

In [0]:
# # Single label event validation
# result = validate_bau_label_event(
#     node_token="C_hqc22hyxf",
#     label_type="BLACKLISTED",
#     label_present=True,
#     effective_at_ms=1749545984851, 
#     ccgce_table=ccgce_table
# )
# print(result.get_summary())

ℹ️ Label Event C_hqc22hyxf/BLACKLISTED/True/1749545984851: All validations passed


## Validation Result Analysis

In [0]:
# load file containing the validation results
validation_df = pd.read_csv(f"/dbfs/{output_path}")
validation_df

Unnamed: 0,label_event_id,validation_success,validation_status,missing_events,unexpected_events,timestamp_errors,connection_type_errors,payload_errors
0,C_hj2bapmtt/BLACKLISTED/True/1750930520679,True,Input label event resulted in NO_CHANGE to AH ...,,,,,
1,C_hs2ja1mdc/BLACKLISTED/True/1750946181287,True,Input label event resulted in NO_CHANGE to AH ...,,,,,
2,C_0nc5rhyf2/BLACKLISTED/True/1750938982725,True,All validations passed,,,,,
3,C_fy2t91yvf/HAS_ANY_CASH_DISPUTE/True/17509311...,True,All validations passed,,,,,
4,C_09edphmsn/HAS_CASH_CHECK_RETURNS/True/175092...,True,All validations passed,,,,,
...,...,...,...,...,...,...,...,...
988,C_n0c2wpy9n/BLACKLISTED/True/1750933607991,True,Input label event resulted in NO_CHANGE to AH ...,,,,,
989,C_n627p0mdw/BLACKLISTED/True/1750952709783,True,All validations passed,,,,,
990,C_5qdrxqmyb/BLACKLISTED/True/1750933669972,True,Input label event resulted in NO_CHANGE to AH ...,,,,,
991,C_50dexpys6/BLACKLISTED/True/1750948325895,True,All validations passed,,,,,


In [0]:
# Failed validationss
failed_validations_df = validation_df[validation_df['validation_success'] == False][['label_event_id', 'validation_status', 'missing_events', 'unexpected_events', 'timestamp_errors', 'connection_type_errors']]
failed_validations_df

Unnamed: 0,label_event_id,validation_status,missing_events,unexpected_events,timestamp_errors,connection_type_errors
6,C_n02waqm42/HAS_CASH_CARD_DID_NOT_MAKE_TRANSAC...,FAIL,,L/AH_00y1phcn5/AH_24mqaw20n/HAS_CASH_CARD_DID_...,,
8,C_5nepxgmae/HAS_CASH_CARD_DISPUTE/True/1750941...,FAIL,,L/AH_19m1p5ets/AH_eamgxpen5/HAS_CASH_CARD_DISP...,,
123,C_n0dv2amxq/BLACKLISTED/True/1750950897913,FAIL,,L/AH_1fyg4ee4s/AH_qxma2vd0n/BLACKLISTED/175095...,,
152,C_syddxamj2/HAS_CASH_CHECK_RETURNS/True/175092...,FAIL,,L/AH_7xm8cnets/AH_2jmaxddys/HAS_CASH_CHECK_RET...,,
196,C_gedhwqmnp/HAS_CASH_CARD_DISPUTE/True/1750944...,FAIL,,L/AH_6hy0cpdtn/AH_pnmqwhdeg/HAS_CASH_CARD_DISP...,,
229,C_nhec2qm0p/HAS_ANY_CASH_DISPUTE/True/17509414...,FAIL,,L/AH_1dyqehcaf/AH_p0mq2cehn/HAS_ANY_CASH_DISPU...,,
239,C_nccnp1yte/HAS_CASH_CARD_DISPUTE/True/1750936...,FAIL,,L/AH_5gm0rydnh/AH_ety1pnccn/HAS_CASH_CARD_DISP...,,
244,C_fqexdgm00/HAS_CASH_CARD_DISPUTE/True/1750949...,FAIL,,L/AH_mjy89td0s/AH_00mgdxeqf/HAS_CASH_CARD_DISP...,,
275,C_5acvdqm99/HAS_CASH_CARD_DID_NOT_MAKE_TRANSAC...,FAIL,,L/AH_6cmqaf268/AH_99mqdvca5/HAS_CASH_CARD_DID_...,,
286,C_sqc2rayrk/BLACKLISTED/True/1750941139530,FAIL,,L/AH_vvygj8c68/AH_kryar2cqs/BLACKLISTED/175094...,,


In [0]:
if failed_validations_df.empty:
    print("All validations passed")
else:
    failed_validation_ids = failed_validations_df['label_event_id'].unique().tolist()
    num_failed = len(failed_validation_ids)
    all_validated = validation_df['label_event_id'].unique().tolist()
    num_validated = len(all_validated)
    print(f"{num_failed} out of {num_validated} label events had failed validations ({num_failed/num_validated*100:.2f}%):\n\n{failed_validation_ids}")

35 out of 993 label events had failed validations (3.52%):

['C_n02waqm42/HAS_CASH_CARD_DID_NOT_MAKE_TRANSACTION_DISPUTE/True/1750950093000', 'C_5nepxgmae/HAS_CASH_CARD_DISPUTE/True/1750941726000', 'C_n0dv2amxq/BLACKLISTED/True/1750950897913', 'C_syddxamj2/HAS_CASH_CHECK_RETURNS/True/1750929692776', 'C_gedhwqmnp/HAS_CASH_CARD_DISPUTE/True/1750944954000', 'C_nhec2qm0p/HAS_ANY_CASH_DISPUTE/True/1750941460000', 'C_nccnp1yte/HAS_CASH_CARD_DISPUTE/True/1750936537000', 'C_fqexdgm00/HAS_CASH_CARD_DISPUTE/True/1750949030000', 'C_5acvdqm99/HAS_CASH_CARD_DID_NOT_MAKE_TRANSACTION_DISPUTE/True/1750947053000', 'C_sqc2rayrk/BLACKLISTED/True/1750941139530', 'C_86c5dhy9e/HAS_CASH_CARD_DISPUTE/True/1750944771000', 'C_5024wgmxs/BLACKLISTED/True/1750946312624', 'C_fcd8dam9m/BLACKLISTED/True/1750942788505', 'C_5jdvw0mse/HAS_CASH_CARD_DISPUTE/True/1750951461000', 'C_5acvdqm99/HAS_ANY_CASH_DISPUTE/True/1750947053000', 'C_fmcck0ycx/BLACKLISTED/True/1750934193606', 'C_56cwjhmrd/BLACKLISTED/True/1750931219053'

### ✅ Missing events

Account holders with expected events that were missing in the actual output.

In [0]:
# Filter rows where missing_events is not empty
missing_events_df = validation_df[validation_df['missing_events'].notna()]

if missing_events_df.empty:
    print("No missing events found in the validation results")
else:
    print(f"Found {len(missing_events_df)} label ids with missing events:\n{missing_events_df['label_event_id'].to_list()}\n")
    for _, row in missing_events_df.iterrows():
        events = row['missing_events'].split('|')
        print(f"Label event: {row['label_event_id']}")
        print(f"Number of missing events: {len(events)}")
        print("Missing events:")
        for event in events:
            print(f"  - {event}")
        print()

No missing events found in the validation results


### ✅ Unexpected events

Account holders who had actual events that were not expected based on the validation.

A sample of the account holders that failed validation have been manually checked. In all cases the reason for the discrepancy was that the edge that caused the unexpected events exists in Duplograph's database (checked via RPC), but it's missing in `duplograph.public.edges` in Snowflake.

In [0]:
# Filter rows where unexpected_events is not empty
unexpected_events_df = validation_df[validation_df['unexpected_events'].notna()]

if unexpected_events_df.empty:
    print("No unexpected events found in the validation results")
else:
    print(f"Found {len(unexpected_events_df)} label ids with unexpected events: \n{unexpected_events_df['label_event_id'].to_list()}\n")
    
    for _, row in unexpected_events_df.iterrows():
        events = row['unexpected_events'].split('|')
        
        print(f"Label event: {row['label_event_id']}")
        print("Unexpected events:")
        for event in events:
            print(f"  - {event}")
        print()

Found 30 label ids with unexpected events: 
['C_n02waqm42/HAS_CASH_CARD_DID_NOT_MAKE_TRANSACTION_DISPUTE/True/1750950093000', 'C_5nepxgmae/HAS_CASH_CARD_DISPUTE/True/1750941726000', 'C_n0dv2amxq/BLACKLISTED/True/1750950897913', 'C_syddxamj2/HAS_CASH_CHECK_RETURNS/True/1750929692776', 'C_gedhwqmnp/HAS_CASH_CARD_DISPUTE/True/1750944954000', 'C_nhec2qm0p/HAS_ANY_CASH_DISPUTE/True/1750941460000', 'C_nccnp1yte/HAS_CASH_CARD_DISPUTE/True/1750936537000', 'C_fqexdgm00/HAS_CASH_CARD_DISPUTE/True/1750949030000', 'C_5acvdqm99/HAS_CASH_CARD_DID_NOT_MAKE_TRANSACTION_DISPUTE/True/1750947053000', 'C_sqc2rayrk/BLACKLISTED/True/1750941139530', 'C_86c5dhy9e/HAS_CASH_CARD_DISPUTE/True/1750944771000', 'C_fcd8dam9m/BLACKLISTED/True/1750942788505', 'C_5jdvw0mse/HAS_CASH_CARD_DISPUTE/True/1750951461000', 'C_5acvdqm99/HAS_ANY_CASH_DISPUTE/True/1750947053000', 'C_fmcck0ycx/BLACKLISTED/True/1750934193606', 'C_56cwjhmrd/BLACKLISTED/True/1750931219053', 'C_fn2z9py25/HAS_CASH_CHECK_RETURNS/True/1750929695191', 'C_

### ✅ Connection type mismatches

Mismatches between expected and connection types on the change event between the source and target account holders.

In [0]:
# Filter rows where connection_type_errors is not empty
connection_type_errors_df = validation_df[validation_df['connection_type_errors'].notna()]

if connection_type_errors_df.empty:
    print("No label errors found in the validation results")
else:
    print(f"Found {len(connection_type_errors_df)} label ids with connection type errors:\n{connection_type_errors_df['label_event_id'].to_list()}\n")
    
    for _, row in connection_type_errors_df.iterrows():
        events = row['connection_type_errors'].split('|')
        
        print(f"Label event: {row['label_event_id']}")
        print("Connection type errors:")
        for event in events:
            print(f"  - {event}")
        print()

Found 5 label ids with connection type errors:
['C_5024wgmxs/BLACKLISTED/True/1750946312624', 'C_5he5rayjv/BLACKLISTED/True/1750929122046', 'C_gsd7aam48/BLACKLISTED/True/1750936613176', 'C_fjctr1yv8/BLACKLISTED/True/1750949037678', 'C_fqercqmg3/BLACKLISTED/True/1750952422166']

Label event: C_5024wgmxs/BLACKLISTED/True/1750946312624
Connection type errors:
  - Connection type mismatch for event L/AH_thmh4cdxf/AH_sxmgw4205/BLACKLISTED/1750946312624/+:
  Expected: ['SSN']
  Actual:   ['SSN', 'VERIFIED_SSN']
  - Connection type mismatch for event L/AH_rhmq4zdhn/AH_sxmgw4205/BLACKLISTED/1750946312624/+:
  Expected: ['CASH_DEVICE', 'CASH_OPAQUE_APP_TOKEN', 'SSN']
  Actual:   ['CASH_DEVICE', 'CASH_OPAQUE_APP_TOKEN', 'SSN', 'VERIFIED_SSN']
  - Connection type mismatch for event L/AH_kym09x2h0/AH_sxmgw4205/BLACKLISTED/1750946312624/+:
  Expected: ['SSN']
  Actual:   ['SSN', 'VERIFIED_SSN']
  - Connection type mismatch for event L/AH_82y1en2nh/AH_sxmgw4205/BLACKLISTED/1750946312624/+:
  Expecte

### ✅ Timestamp errors

Account holders with mismatches between expected event timestamps (based on the effective_at of the label event that triggered the change event) and the actual `effective_at_millis` in the generated change events.

In [0]:
# Filter rows where timestamp_errors is not empty and not NaN
timestamp_errors_df = validation_df[
    validation_df['timestamp_errors'].notna() & (validation_df['timestamp_errors'] != '')
]

if timestamp_errors_df.empty:
    print("No timestamp errors found in the validation results.")
else:
    print(f"Found {len(timestamp_errors_df)} label ids with timestamp errors:\n{timestamp_errors_df['label_event_id'].to_list()}\n")
    for index, row in timestamp_errors_df.iterrows():
        label_id = row['label_event_id']
        # Split the pipe-separated error strings
        error_messages = row['timestamp_errors'].split('|')
        
        print(f"Label event: {label_id}")
        print(f"Number of timestamp errors: {len(error_messages)}")
        print("Timestamp error details:")
        for error_detail in error_messages:
            # Replace literal '\\n' from CSV with actual newlines for readable printing
            readable_error_detail = error_detail.replace('\\\\n', '\\n')
            print(f"  - {readable_error_detail}")
        print("\n") # Add a blank line for separation between labels

No timestamp errors found in the validation results.


## ✅ General output checks

Checks general properties of the output event data set:
- Are there any events that have the same source_user_token and target_user_token?
- Are there duplicate events with the same event_id? (if yes, checks whether their contents are also identical)
- Are there are any events where event_time_millis is different to cash_connected_graph_change_event.effective_at_millis?
- Are there any events where cash_connected_graph_change_event.label_change.changed_source_user_label is not in the event_id?
- Are there any events where the source_user_token and target_user_token are not in the event_id?
- Are there any events where the event_type does not match the event_id?
- Are there any events where the effective_at does not match the event_id?

In [0]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import udf, col, count as spark_count, lit, when
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql import functions as F

def parse_event_id(event_id: str) -> Tuple[str, str, str, str, str]:
    """Parse an event ID of format 'L/target/source/type/effective_at_ms/label_change' into its components.
    Returns (target, source, label_type, effective_at, label_change)."""
    try:
        _, remaining = event_id.split('L/', 1)
        target, source, label_type, effective_at, label_change = remaining.split('/')
        return (target, source, label_type, effective_at, label_change)
    except:
        return (None, None, None, None, None)
    
# Define the schema for the UDF's return type (a struct)
event_id_schema = StructType([
    StructField("id_target_token", StringType(), True),
    StructField("id_source_token", StringType(), True),
    StructField("id_label_type", StringType(), True),
    StructField("id_effective_at_millis_str", StringType(), True),
    StructField("id_marker", StringType(), True)
])

# Create the Spark UDF
parse_event_id_udf = udf(parse_event_id, event_id_schema)

def format_event(event: dict) -> str: 
    """Format a single CashConnectedGraphChangeEvent LabelChange event in a readable way."""
    # Convert connection_node_types to a sorted tuple for consistent display if it's a list
    conn_types_display = event.get('connection_node_types', [])
    if isinstance(conn_types_display, list):
        conn_types_display = tuple(sorted(conn_types_display))
    
    output = [
        f"Event ID: {event.get('event_id', 'N/A')}",
        f"Source: {event.get('source_user_token', 'N/A')}",
        f"Target: {event.get('target_user_token', 'N/A')}",
        f"Label Type: {event.get('changed_label_type', 'N/A')}",
        f"Effective At: {event.get('effective_at_millis', 'N/A')}",
        f"Published At: {event.get('published_at_millis', 'N/A')}",   
        f"Event Type: {event.get('event_type', 'N/A')}",
        f"Connection Node Types: {conn_types_display}",
        f"Event Time: {event.get('event_time_millis', 'N/A')}"
    ]
    return "\n  ".join(output)

def format_validation_sample(df: pd.DataFrame, validation_type: str, max_samples: int = 3) -> str:
    """Format validation results with samples of problematic events."""
    output = []
    for idx, event in df.head(max_samples).iterrows():
        output.append(f"\n{validation_type} {idx + 1}:")
        output.append("  " + format_event(event))
    return "\n".join(output)

In [0]:
def analyze_duplicate_event_ids(spark_df):
    """
    Analyze and report on duplicate event_ids in the Spark DataFrame.
    For each duplicated event_id, efficiently check at Spark scale
    that all business payload fields (with connection_node_types sorted) are identical.
    If mismatches are found, display a sample of actual differing payloads.
    """

    # List of business fields; for connection_node_types, use a sorted version
    business_fields = [
        'source_user_token',
        'target_user_token',
        'event_type',
        'changed_label_type',
        'effective_at_millis',
        'event_time_millis',
        'source_event_node_token',
        'source_event_label',
        'source_event_present',
        'source_event_effective_at',
        'source_event_created_at',
        'source_event_updated_at'
        # 'connection_node_types' will be replaced by the sorted version below
    ]

    print("\n\n2. Event IDs that appear more than once:")

    unique_event_ids_count = spark_df.select("event_id").distinct().count()
    print(f"  Total unique event_ids: {unique_event_ids_count}")

    # Add a sorted version of connection_node_types for canonical comparison
    spark_df_sorted = spark_df.withColumn(
        "conn_types_sorted",
        F.sort_array(F.col("connection_node_types"))
    )

    # Create a struct for business fields including the sorted connection_node_types
    spark_df_struct = spark_df_sorted.withColumn(
        "business_payload",
        F.struct(*([F.col(c) for c in business_fields] + [F.col("conn_types_sorted")]))
    )

    # Group and count duplicates and distinct business payloads per event_id
    dupe_payload_check = (
        spark_df_struct.groupBy("event_id")
        .agg(
            F.count("*").alias("dup_count"),
            F.countDistinct("business_payload").alias("distinct_payload_count")
        )
        .filter(F.col("dup_count") > 1)
        .cache()
    )

    total_dupe_event_ids = dupe_payload_check.count()
    print(f"  Total duplicate event_ids checked: {total_dupe_event_ids}")

    # Find event_ids where payloads are not all identical
    bad_dupes = dupe_payload_check.filter("distinct_payload_count > 1").cache()
    num_bad_dupes = bad_dupes.count()
    print(f"  Number of duplicate event_ids with NON-identical business payloads: {num_bad_dupes}")

    if num_bad_dupes > 0:
        print("\nSample of non-identical duplicates (event_id, dup_count, distinct_payload_count):")
        bad_dupes.select("event_id", "dup_count", "distinct_payload_count").show(10, truncate=False)
        
        # Show actual mismatched payloads for up to 3 problematic event_ids
        bad_ids = [row.event_id for row in bad_dupes.limit(5).collect()]
        if bad_ids:
            print("\nSample mismatched payloads for problematic event_ids:")
            mismatches = spark_df_struct.filter(F.col("event_id").isin(bad_ids)).select(
                "event_id", *business_fields, "conn_types_sorted", "published_at_millis", "source_label_event"
            ).orderBy("event_id", "event_time_millis", "published_at_millis")
            mismatches.show(50, truncate=False)
    else:
        print("All duplicate event_ids are identical in the specified business fields (with connection_node_types compared as sets).")

    # Optionally: show top 10 most common duplicates (by count)
    print("\n  Top 10 event_ids with most duplicates:")
    dupe_payload_check.orderBy(F.col("dup_count").desc()).select("event_id", "dup_count").show(10, truncate=False)

    # Cleanup
    dupe_payload_check.unpersist()
    bad_dupes.unpersist()
    return total_dupe_event_ids, num_bad_dupes

In [0]:
def validate_bau_label_output(ccgce_table: str, spark_session):
    """Validate general properties of BAU label events using Spark for performance.
    
    Args:
        ccgce_table: Full name of the CashConnectedGraphChangeEvent table to validate.
        spark_session: The active SparkSession.
    """
    query = f"""
    SELECT 
        event_time_millis,
        cash_connected_graph_change_event.event_id as event_id,
        cash_connected_graph_change_event.source_user_token as source_user_token,
        cash_connected_graph_change_event.target_user_token as target_user_token,
        cash_connected_graph_change_event.effective_at_millis as effective_at_millis,
        cash_connected_graph_change_event.published_at_millis as published_at_millis,
        cash_connected_graph_change_event.event_type as event_type,
        cash_connected_graph_change_event.label_change.changed_source_user_label as changed_label_type,
        cash_connected_graph_change_event.label_change.connection_node_types as connection_node_types,
        cash_connected_graph_change_event.label_change.source_label_event.node.token as source_event_node_token,
        cash_connected_graph_change_event.label_change.source_label_event.label as source_event_label,
        cash_connected_graph_change_event.label_change.source_label_event.effective_at_msec as source_event_effective_at,
        cash_connected_graph_change_event.label_change.source_label_event.present as source_event_present,
        cash_connected_graph_change_event.label_change.source_label_event.created_at as source_event_created_at,
        cash_connected_graph_change_event.label_change.source_label_event.updated_at as source_event_updated_at
    FROM {ccgce_table}
    WHERE cash_connected_graph_change_event.event_source_type = 'BAU'
    AND cash_connected_graph_change_event.label_change IS NOT NULL
    """
    
    spark_df = spark_session.sql(query)
    spark_df.cache() 
    
    total_events = spark_df.count()
    print(f"Total BAU LabelChange events to validate: {total_events}\n")
    if total_events == 0:
        spark_df.unpersist()
        return

    spark_df_parsed = spark_df.withColumn("id_parts", parse_event_id_udf(col("event_id")))
    spark_df_parsed.cache() # Cache the parsed version as well

    # 1. Events with same source and target user
    self_ref_df_spark = spark_df_parsed.filter(col("source_user_token") == col("target_user_token"))
    self_ref_count = self_ref_df_spark.count()
    print("\n1. Events with same source and target user:")
    print(f"Total count: {self_ref_count}")
    if self_ref_count > 0:
        print("\nSample of such events:")
        print(format_validation_sample(self_ref_df_spark.limit(3).toPandas(), "Self-referential event"))

    # 2. Analyze Duplicate Event IDs
    dup_id_count, dup_mismatched_payload_count = analyze_duplicate_event_ids(spark_df_parsed)

    # 3. Events with event_time_millis != effective_at_millis
    timestamp_mismatch_df_spark = spark_df_parsed.filter(col("event_time_millis") != col("effective_at_millis"))
    timestamp_mismatch_count = timestamp_mismatch_df_spark.count()
    print("\n\n3. Events with event_time_millis != effective_at_millis:")
    print(f"Total count: {timestamp_mismatch_count}")
    if timestamp_mismatch_count > 0:
        print("\nSample of such events:")
        print(format_validation_sample(timestamp_mismatch_df_spark.limit(3).toPandas(), "Mismatched root timestamp"))

    # 4. Events where label_change.changed_source_user_label (payload) is not in the event_id
    label_type_mismatch_df_spark = spark_df_parsed.filter(col("id_parts.id_label_type") != col("changed_label_type"))
    label_type_mismatch_count = label_type_mismatch_df_spark.count()
    print("\n\n4. Events where changed_label_type (payload) != id_label_type (from event_id):") 
    print(f"Total count: {label_type_mismatch_count}")
    if label_type_mismatch_count > 0:
        print("\nSample of such events:") 
        print(format_validation_sample(label_type_mismatch_df_spark.limit(3).toPandas(), "Label type mismatch (payload vs ID)"))

    # 5. Events where source/target user tokens (payload) are not in the event_id (in specific order: target then source in ID)
    token_mismatch_df_spark = spark_df_parsed.filter(
        (col("id_parts.id_target_token") != col("target_user_token")) | \
        (col("id_parts.id_source_token") != col("source_user_token"))
    )
    token_mismatch_count = token_mismatch_df_spark.count()
    print("\n\n5. Events where source/target tokens (payload) don't match event_id structure:") 
    print(f"Total count: {token_mismatch_count}")
    if token_mismatch_count > 0:
        print("\nSample of such events:") 
        print(format_validation_sample(token_mismatch_df_spark.limit(3).toPandas(), "Token mismatch (payload vs ID)"))

    # 6. Events where event_type (payload) does not match the marker in event_id
    event_type_vs_marker_mismatch_df_spark = spark_df_parsed.withColumn(
        "expected_marker",
        when(col("event_type") == lit("LABEL_ADDED"), lit("+"))
        .when(col("event_type") == lit("LABEL_REMOVED"), lit("-"))
        .otherwise(lit("UNKNOWN_MARKER_FOR_EVENT_TYPE"))
    ).filter(col("id_parts.id_marker") != col("expected_marker"))
    event_type_vs_marker_mismatch_count = event_type_vs_marker_mismatch_df_spark.count()
    print("\n\n6. Events where event_type (payload) marker doesn't match id_marker (from event_id):") 
    print(f"Total count: {event_type_vs_marker_mismatch_count}")
    if event_type_vs_marker_mismatch_count > 0:
        print("\nSample of such events:") 
        print(format_validation_sample(event_type_vs_marker_mismatch_df_spark.limit(3).toPandas(), "EventType/Marker mismatch"))
        
    # 7. Events where effective_at_millis (payload) does not match effective_at_ms in event_id
    effective_at_mismatch_df_spark = spark_df_parsed.filter(
        col("id_parts.id_effective_at_millis_str") != col("effective_at_millis").cast(StringType())
    )
    effective_at_mismatch_count = effective_at_mismatch_df_spark.count()
    print("\n\n7. Events where effective_at_millis (payload) doesn't match id_effective_at_millis_str (from event_id):") 
    print(f"Total count: {effective_at_mismatch_count}")
    if effective_at_mismatch_count > 0:
        print("\nSample of such events:")
        print(format_validation_sample(effective_at_mismatch_df_spark.limit(3).toPandas(), "EffectiveAt mismatch (payload vs ID)"))

    # --- Summary ---
    print("\n\n--- Overall Sanity Check Summary ---") 
    print(f"Total events validated: {total_events}")
    print(f"1. Self-referential events: {self_ref_count}")
    print(f"2. Event IDs with duplicates: {dup_id_count}. With mismarched payloads: {dup_mismatched_payload_count}")
    print(f"3. Root timestamp mismatches (event_time_millis vs effective_at_millis): {timestamp_mismatch_count}")
    print(f"4. Payload label_type vs. Event ID label_type mismatches: {label_type_mismatch_count}")
    print(f"5. Payload tokens vs. Event ID tokens mismatches: {token_mismatch_count}")
    print(f"6. Payload event_type vs. Event ID marker mismatches: {event_type_vs_marker_mismatch_count}")
    print(f"7. Payload effective_at vs. Event ID effective_at mismatches: {effective_at_mismatch_count}")

    spark_df_parsed.unpersist() # Unpersist the parsed df
    spark_df.unpersist()

In [0]:
validate_bau_label_output(ccgce_table, spark)

Total BAU LabelChange events to validate: 6743088


1. Events with same source and target user:
Total count: 0


2. Event IDs that appear more than once:
  Total unique event_ids: 4152611
  Total duplicate event_ids checked: 1892141
  Number of duplicate event_ids with NON-identical business payloads: 0
All duplicate event_ids are identical in the specified business fields (with connection_node_types compared as sets).

  Top 10 event_ids with most duplicates:
+------------------------------------------------------------------------------------------+---------+
|event_id                                                                                  |dup_count|
+------------------------------------------------------------------------------------------+---------+
|L/AH_e3ypkswng/AH_ksy04x2x0/HAS_CASH_CARD_AFT_REASON_CODE_DISPUTE/1750976222000/+         |4        |
|L/AH_cfmgjpcag/AH_8hma4neyf/HAS_CASH_CARD_AFT_REASON_CODE_DISPUTE/1750974064000/+         |4        |
|L/AH_shmhkh2qh/AH_n

### Duplicate event_ids in output

A large number of duplicate `event_id`s were observed in edgy's LabelChange output events. Further analysis confirmed that for all such duplicates the contents of the events are identical except for `published_at_millis`.

Presidio log analysis showed that that duplicate consumption of label events happens consistently not just in edgy's `LabelKafkaSAMConsumer`, but also in other downstream consumers of Duplograph label events (such as Duplograph's signals service), confirming that the duplication originates upstream.

Downstream consumers of these events (i.e. Beacon and Foundry) handle KnowledgeEvents with identical `event_id` and `event_time_millis` as duplicates, so the presence of these duplicate events in edgy's output should not result in incorrect downstream processing. Therefore no action is required as far as signal correctness goes, but the duplicate processing IS adding unnecessary load to the system so it would be good to address it upstream in Duplograph.