# 02 â€” Split Test Layers

**Objective:** Divide the transaction codes into three distinct cohorts for layered testing, from easiest to hardest.

### The 3 Layers:
1. **Layer 1 (Obvious):** Codes with a single, unambiguous mapping in the ground truth.
2. **Layer 2 (Ambiguous):** Codes with multiple possible mappings in the ground truth that require description context to distinguish.
3. **Layer 3 (Unknown):** Codes present in raw data but missing from the original Master Fee Table (e.g., RTP, FedNow, Wires).

In [0]:
import pandas as pd
import numpy as np
import os

# Paths relative to project root
GT_PATH = "../taxonomy/data/ground_truth_normalized.csv"
CATALOG_PATH = "../taxonomy/data/transaction_code_catalog.csv"
OUTPUT_DIR = "../taxonomy/data/"

print("Loading datasets...")
# Fillna('') handles the empty L1/L2 values for unknown codes
df_gt = pd.read_csv(GT_PATH, dtype=str).fillna('')
df_catalog = pd.read_csv(CATALOG_PATH, dtype=str).fillna('')

print(f"Ground truth rows: {len(df_gt)}")
print(f"Catalog rows: {len(df_catalog)}")

In [0]:
# 1. Define Layer 3 (Unknown)
# These are codes where the Level 1 category is empty in the ground truth file
l3_codes = df_gt[df_gt['L1'] == '']['TRANCD'].unique()
df_l3 = df_catalog[df_catalog['TRANCD'].isin(l3_codes)].copy()

print(f"Layer 3 (Unknown): {len(df_l3)} codes found.")
print(f"Layer 3 codes: {list(l3_codes)}")

In [0]:
# 2. Define Layer 2 (Ambiguous)
# These are codes that have MORE THAN ONE entry in the mapped ground truth
code_counts = df_gt[df_gt['L1'] != '']['TRANCD'].value_counts()
ambiguous_codes = code_counts[code_counts > 1].index.tolist()

df_l2 = df_catalog[df_catalog['TRANCD'].isin(ambiguous_codes)].copy()

print(f"Layer 2 (Ambiguous): {len(df_l2)} codes identified.")
print(f"Layer 2 codes: {ambiguous_codes}")

In [0]:
# 3. Define Layer 1 (Obvious)
# These are codes that have EXACTLY ONE mapping in GT and L1 is not empty
obvious_codes = code_counts[code_counts == 1].index.tolist()
df_l1 = df_catalog[df_catalog['TRANCD'].isin(obvious_codes)].copy()

print(f"Layer 1 (Obvious): {len(df_l1)} codes identified.")

In [0]:
# 4. Save test sets
df_l1.to_csv(os.path.join(OUTPUT_DIR, "layer_1_test_set.csv"), index=False)
df_l2.to_csv(os.path.join(OUTPUT_DIR, "layer_2_test_set.csv"), index=False)
df_l3.to_csv(os.path.join(OUTPUT_DIR, "layer_3_test_set.csv"), index=False)

print("Test layers saved successfully:")
print(f"- Layer 1: {os.path.join(OUTPUT_DIR, 'layer_1_test_set.csv')}")
print(f"- Layer 2: {os.path.join(OUTPUT_DIR, 'layer_2_test_set.csv')}")
print(f"- Layer 3: {os.path.join(OUTPUT_DIR, 'layer_3_test_set.csv')}")