# 02 â€” Split Test Layers

**Objective:** Divide the transaction codes into three distinct cohorts for layered testing, from easiest to hardest.

### The 3 Layers:
1. **Layer 1 (Obvious):** Codes with a single, unambiguous mapping in the ground truth (~40 codes).
2. **Layer 2 (Ambiguous):** Codes with multiple possible mappings in the ground truth that require description context to distinguish (~11 codes).
3. **Layer 3 (Unknown):** Codes present in raw data but missing from ground truth (7 codes: 34, 42, 44, 46, 66, 67, 83).

In [None]:
import pandas as pd
import numpy as np
import os

# Paths
gt_path = "../taxonomy/data/ground_truth_normalized.csv"
catalog_path = "../taxonomy/data/transaction_code_catalog.csv"

output_dir = "../taxonomy/data/"
l1_path = os.path.join(output_dir, "layer_1_test_set.csv")
l2_path = os.path.join(output_dir, "layer_2_test_set.csv")
l3_path = os.path.join(output_dir, "layer_3_test_set.csv")

print("Loading data...")
df_gt = pd.read_csv(gt_path, dtype={'TRANCD': str})
df_catalog = pd.read_csv(catalog_path, dtype={'TRANCD': str})

print(f"Ground truth rows: {len(df_gt)}")
print(f"Catalog rows: {len(df_catalog)}")

In [None]:
# 1. Define Layer 3 (Unknown)
# These are codes in catalog but NOT in ground truth
l3_codes = df_catalog[~df_catalog['TRANCD'].isin(df_gt['TRANCD'])]['TRANCD'].unique()
df_l3 = df_catalog[df_catalog['TRANCD'].isin(l3_codes)].copy()

print(f"Layer 3 (Unknown): {len(df_l3)} codes found.")
print(f"Layer 3 codes: {list(l3_codes)}")

In [None]:
# 2. Define Layer 2 (Ambiguous)
# In Step 0, we dropped duplicates from ground_truth_normalized.csv.
# To identify Layer 2, we need to re-examine the original Master Fee Table 
# or check our list of 11 multi-mapping codes.

ambiguous_codes = [
    '15', '49', '62', '110', '145', '150', '155', '240', '641', '642', '646'
]

# Filter catalog to these codes if they exist in ground truth
df_l2 = df_catalog[df_catalog['TRANCD'].isin(ambiguous_codes) & df_catalog['TRANCD'].isin(df_gt['TRANCD'])].copy()

print(f"Layer 2 (Ambiguous): {len(df_l2)} codes identified.")

In [None]:
# 3. Define Layer 1 (Obvious)
# Codes in ground truth that are NOT in Layer 2
df_l1 = df_catalog[df_catalog['TRANCD'].isin(df_gt['TRANCD']) & ~df_catalog['TRANCD'].isin(ambiguous_codes)].copy()

print(f"Layer 1 (Obvious): {len(df_l1)} codes identified.")

In [None]:
# 4. Save test sets
df_l1.to_csv(l1_path, index=False)
df_l2.to_csv(l2_path, index=False)
df_l3.to_csv(l3_path, index=False)

print("Test layers saved successfully:")
print(f"- Layer 1: {l1_path}")
print(f"- Layer 2: {l2_path}")
print(f"- Layer 3: {l3_path}")