# 05 — Evaluate Transaction Categorization Accuracy

**Objective:** Compare LLM predictions from `04_transaction_categorization_test` against the Master Fee Table ground truth.

### Metrics
1. **Per-Level Accuracy** — L1, L2, L3, L4 independently (case-insensitive).
2. **Exact Match** — All 4 levels correct.
3. **Partial Match** — L1 + L2 correct (right block and category).
4. **Volume-Weighted** — Weighted by transaction count (code 183 at 99K txns matters more than code 110 at 4 txns).
5. **Per-Layer** — Obvious (single GT mapping) vs Ambiguous (multi GT mapping) vs Unknown (no GT).
6. **Failure Analysis** — Root cause patterns and prompt improvement recommendations.

### Key Design Decisions
- **Case-insensitive** — GT has `Debit card`, LLM outputs `Debit Card`. Both correct.
- **Null-safe** — `None`, `NaN`, `null`, `N/A` all treated as equivalent.
- **Multi-mapping** — 12 codes have 2+ valid GT mappings. LLM is correct if it matches ANY.
- **GT normalization** — Fix casing inconsistencies (`Fee Item` → `Fee item`, `NSF /OD` → `NSF/OD`).

In [None]:
import pandas as pd
import numpy as np
import json
import os

# ===========================================================================
# CONFIGURATION — Update these paths for your environment
# ===========================================================================
RESULTS_PATH = "../results/04_transaction_categorization_test.csv"
GT_PATH = "../taxonomy/data/Master_Fee_Table_Master_.csv"
OUTPUT_DIR = "../results"

os.makedirs(OUTPUT_DIR, exist_ok=True)
print("Configuration ready.")

---
## 1. Load & Parse LLM Results

In [None]:
df_raw = pd.read_csv(RESULTS_PATH, dtype={"TRANCD": str})
print(f"Loaded {len(df_raw)} predictions.")

# Parse the JSON string in the 'parsed' column into separate columns
parsed = df_raw["parsed"].apply(json.loads).apply(pd.Series)

df_res = pd.concat(
    [df_raw[["TRANCD", "sample_desc_1", "volume", "source_file"]], parsed], axis=1
)

# Prefix LLM columns so they're unambiguous after the merge
df_res = df_res.rename(
    columns={
        "category_1": "llm_L1",
        "category_2": "llm_L2",
        "category_3": "llm_L3",
        "category_4": "llm_L4",
        "include_in_scoring": "llm_scoring",
        "credit_debit": "llm_credit_debit",
        "confidence": "llm_confidence",
    }
)

print(f"\nLLM value distribution:")
for col in ["llm_L1", "llm_L2", "llm_L3", "llm_L4"]:
    vals = sorted(df_res[col].dropna().unique())
    print(f"  {col}: {vals}")

df_res.head(3)

---
## 2. Load & Normalize Ground Truth

In [None]:
df_gt_raw = pd.read_csv(GT_PATH, encoding="latin-1")
df_gt_raw.columns = [c.strip() for c in df_gt_raw.columns]

# Drop rows that have no transaction code (header leaks, blanks)
df_gt = df_gt_raw[
    df_gt_raw["External Transaction Code"].notna()
    & (df_gt_raw["External Transaction Code"].astype(str).str.strip() != "")
].copy()

df_gt["TRANCD"]         = df_gt["External Transaction Code"].astype(str).str.strip()
df_gt["gt_desc"]        = df_gt["External Transaction Description"].str.strip()
df_gt["gt_L1"]          = df_gt["Scoring Category 1"].str.strip()
df_gt["gt_L2"]          = df_gt["Scoring Category 2"].str.strip()
df_gt["gt_L3"]          = df_gt["Scoring Category 3"].str.strip()
df_gt["gt_L4"]          = df_gt["Scoring Category 4"].str.strip()
df_gt["gt_credit_debit"] = df_gt["Credit / Debit"].str.strip()

print(f"Raw GT: {len(df_gt)} rows, {df_gt['TRANCD'].nunique()} unique codes")
print(f"\nBEFORE normalization:")
print(f"  L1 values: {sorted(df_gt['gt_L1'].dropna().unique())}")
print(f"  L2 values: {sorted(df_gt['gt_L2'].dropna().unique())}")

In [None]:
# ── Normalization maps ────────────────────────────────────────────
#  Fix casing inconsistencies and leaked header rows.

L1_NORM = {
    "Fee Item":            "Fee item",
    "Fee item":            "Fee item",
    "Non-fee item":        "Non-fee item",
    "Scoring Category 1":  None,           # header row leak
}

L2_NORM = {
    "NSF /OD":             "NSF/OD",
    "NSF/OD":              "NSF/OD",
    "Money Movement":      "Money movement",
    "Money movement":      "Money movement",
    "Account Operations":  "Account operations",
    "Account operations":  "Account operations",
    "All others":          "All others",
    "Service Charges":     "Service Charges",
    "Scoring Category 2":  None,           # header row leak
}

L3_NORM = {"N/A": None, "Money Movement": "Money movement", "Account Operations": "Account operations"}
L4_NORM = {"N/A": None}

df_gt["gt_L1"] = df_gt["gt_L1"].map(L1_NORM).fillna(df_gt["gt_L1"])
df_gt["gt_L2"] = df_gt["gt_L2"].map(L2_NORM).fillna(df_gt["gt_L2"])
df_gt["gt_L3"] = df_gt["gt_L3"].map(L3_NORM).fillna(df_gt["gt_L3"])
df_gt["gt_L4"] = df_gt["gt_L4"].map(L4_NORM).fillna(df_gt["gt_L4"])

# Drop header-leak rows (L1 mapped to None)
df_gt = df_gt[df_gt["gt_L1"].notna()].copy()

print(f"After normalization: {len(df_gt)} rows, {df_gt['TRANCD'].nunique()} unique codes")
print(f"\nAFTER normalization:")
print(f"  L1 values: {sorted(df_gt['gt_L1'].dropna().unique())}")
print(f"  L2 values: {sorted(df_gt['gt_L2'].dropna().unique())}")
print(f"  L3 values: {sorted(df_gt['gt_L3'].dropna().unique())}")

---
## 3. Assign Test Layers

In [None]:
# Identify multi-mapping codes (same TRANCD → 2+ distinct categorizations)
gt_mapping_counts = (
    df_gt.groupby("TRANCD")
    .apply(lambda g: g[["gt_L1", "gt_L2", "gt_L3", "gt_L4"]].drop_duplicates().shape[0])
    .reset_index(name="n_mappings")
)

multi_codes  = set(gt_mapping_counts.loc[gt_mapping_counts["n_mappings"] > 1, "TRANCD"])
single_codes = set(gt_mapping_counts.loc[gt_mapping_counts["n_mappings"] == 1, "TRANCD"])
all_gt_codes = set(df_gt["TRANCD"].unique())

def assign_layer(trancd):
    if trancd not in all_gt_codes:
        return "Layer 3: Unknown"
    if trancd in multi_codes:
        return "Layer 2: Ambiguous"
    return "Layer 1: Obvious"

df_res["test_layer"] = df_res["TRANCD"].apply(assign_layer)

layer_summary = (
    df_res.groupby("test_layer")
    .agg(codes=("TRANCD", "nunique"), total_volume=("volume", "sum"))
    .reset_index()
)
layer_summary["pct_volume"] = (
    layer_summary["total_volume"] / layer_summary["total_volume"].sum() * 100
).round(1)
print("Test-layer distribution:")
print(layer_summary.to_string(index=False))

---
## 4. Comparison Helpers

In [None]:
# ── Case-insensitive, null-safe comparison ────────────────────────

_NULL = "__null__"

def _canon(val):
    """Canonicalize a value for comparison: lowercase, strip, null-safe."""
    if val is None or (isinstance(val, float) and np.isnan(val)):
        return _NULL
    s = str(val).strip().lower()
    if s in ("", "none", "nan", "null", "n/a"):
        return _NULL
    return s


def levels_match(row, llm_col, gt_col):
    return _canon(row[llm_col]) == _canon(row[gt_col])


def add_match_columns(df):
    """Add per-level and aggregate match booleans."""
    df = df.copy()
    df["match_L1"] = df.apply(levels_match, axis=1, llm_col="llm_L1", gt_col="gt_L1")
    df["match_L2"] = df.apply(levels_match, axis=1, llm_col="llm_L2", gt_col="gt_L2")
    df["match_L3"] = df.apply(levels_match, axis=1, llm_col="llm_L3", gt_col="gt_L3")
    df["match_L4"] = df.apply(levels_match, axis=1, llm_col="llm_L4", gt_col="gt_L4")

    df["exact_match"]        = df[["match_L1", "match_L2", "match_L3", "match_L4"]].all(axis=1)
    df["partial_match_L1L2"] = df[["match_L1", "match_L2"]].all(axis=1)
    df["partial_match_L1L2L3"] = df[["match_L1", "match_L2", "match_L3"]].all(axis=1)
    return df


print("Helpers ready.")

---
## 5. Layer 1 — Obvious Codes (Single Mapping)

In [None]:
# Build GT lookup for single-mapping codes (1 row per TRANCD)
gt_cols = ["TRANCD", "gt_desc", "gt_L1", "gt_L2", "gt_L3", "gt_L4", "gt_credit_debit"]
df_gt_single = (
    df_gt[df_gt["TRANCD"].isin(single_codes)]
    .drop_duplicates(subset="TRANCD", keep="first")[gt_cols]
)

# Merge
df_l1 = pd.merge(
    df_res[df_res["test_layer"] == "Layer 1: Obvious"],
    df_gt_single,
    on="TRANCD",
    how="left",
)

df_l1 = add_match_columns(df_l1)
n = len(df_l1)

print("=" * 65)
print(f"LAYER 1 — OBVIOUS CODES  (n = {n})")
print("=" * 65)
print(f"  L1 (Fee vs Non-fee):      {df_l1['match_L1'].mean():.1%}")
print(f"  L2 (Category):            {df_l1['match_L2'].mean():.1%}")
print(f"  L3 (Channel):             {df_l1['match_L3'].mean():.1%}")
print(f"  L4 (Subtype):             {df_l1['match_L4'].mean():.1%}")
print(f"  ─────────────────────────────")
print(f"  Partial (L1+L2):          {df_l1['partial_match_L1L2'].mean():.1%}")
print(f"  Partial (L1+L2+L3):       {df_l1['partial_match_L1L2L3'].mean():.1%}")
print(f"  Exact Match (all 4):      {df_l1['exact_match'].mean():.1%}")

In [None]:
# ── Volume-weighted accuracy ──────────────────────────────────────
vol = df_l1["volume"].sum()

print("=" * 65)
print(f"VOLUME-WEIGHTED ACCURACY  (total: {vol:,} transactions)")
print("=" * 65)

for col, label in [
    ("match_L1",            "L1 (Fee vs Non-fee)"),
    ("match_L2",            "L2 (Category)"),
    ("match_L3",            "L3 (Channel)"),
    ("match_L4",            "L4 (Subtype)"),
    ("partial_match_L1L2",  "Partial (L1+L2)"),
    ("exact_match",         "Exact (all 4)"),
]:
    w = (df_l1[col] * df_l1["volume"]).sum() / vol
    print(f"  {label:<25} {w:.1%}")

---
## 6. Layer 2 — Ambiguous Codes (Multi-Mapping)

A prediction is correct if it matches **any** of the valid GT mappings for that code.

In [None]:
df_l2_src = df_res[df_res["test_layer"] == "Layer 2: Ambiguous"].copy()

rows = []
for _, row in df_l2_src.iterrows():
    code = row["TRANCD"]
    gt_maps = (
        df_gt[df_gt["TRANCD"] == code][["gt_L1", "gt_L2", "gt_L3", "gt_L4"]]
        .drop_duplicates()
    )
    llm = tuple(_canon(row[c]) for c in ["llm_L1", "llm_L2", "llm_L3", "llm_L4"])

    best_levels = 0
    matched_any = False
    for _, g in gt_maps.iterrows():
        gt = tuple(_canon(g[c]) for c in ["gt_L1", "gt_L2", "gt_L3", "gt_L4"])
        n_match = sum(a == b for a, b in zip(llm, gt))
        best_levels = max(best_levels, n_match)
        if llm == gt:
            matched_any = True

    rows.append(
        {
            "TRANCD": code,
            "description": row["sample_desc_1"],
            "volume": row["volume"],
            "llm_path": f"{row['llm_L1']} > {row['llm_L2']} > {row['llm_L3']}",
            "exact_match_any": matched_any,
            "best_levels_matched": best_levels,
            "n_valid_mappings": len(gt_maps),
            "confidence": row["llm_confidence"],
        }
    )

df_l2 = pd.DataFrame(rows)

print("=" * 65)
print(f"LAYER 2 — AMBIGUOUS CODES  (n = {len(df_l2)})")
print("=" * 65)
print(f"  Exact match (any valid mapping):  {df_l2['exact_match_any'].mean():.1%}")
print(f"  Avg best levels matched:          {df_l2['best_levels_matched'].mean():.1f} / 4")
print()
print(df_l2.to_string(index=False))

In [None]:
# Show all valid GT mappings for each ambiguous code
print("\nValid GT mappings for ambiguous codes in our results:")
for code in sorted(df_l2["TRANCD"].unique()):
    maps = (
        df_gt[df_gt["TRANCD"] == code][["gt_desc", "gt_L1", "gt_L2", "gt_L3"]]
        .drop_duplicates()
    )
    llm_row = df_l2_src[df_l2_src["TRANCD"] == code].iloc[0]
    print(f"\n  TRANCD={code}  (LLM: {llm_row['llm_L1']} > {llm_row['llm_L2']} > {llm_row['llm_L3']})")
    for _, m in maps.iterrows():
        print(f"    GT: {m['gt_L1']:<15} > {m['gt_L2']:<22} > {str(m['gt_L3']):<25} | {m['gt_desc'][:45]}")

---
## 7. Layer 3 — Unknown Codes (No Ground Truth)

In [None]:
df_l3 = df_res[df_res["test_layer"] == "Layer 3: Unknown"].copy()

print("=" * 65)
print(f"LAYER 3 — UNKNOWN CODES  (n = {len(df_l3)}, no ground truth)")
print("=" * 65)
print("These need manual review by Sid / Mike.\n")

for _, r in df_l3.sort_values("volume", ascending=False).iterrows():
    print(f"  TRANCD={r['TRANCD']:>5} | vol={r['volume']:>6,} | conf={r['llm_confidence']}")
    print(f"    desc: {r['sample_desc_1']}")
    print(f"    LLM:  {r['llm_L1']} > {r['llm_L2']} > {r['llm_L3']} > {r['llm_L4']}")
    print(f"    scoring: {r['llm_scoring']}")
    print()

---
## 8. Failure Analysis — Layer 1

In [None]:
failures = df_l1[~df_l1["exact_match"]].copy()

def failure_type(row):
    if not row["match_L1"]:
        return "WRONG BLOCK (L1)"
    if not row["match_L2"]:
        return "WRONG CATEGORY (L2)"
    if not row["match_L3"]:
        return "WRONG CHANNEL (L3)"
    if not row["match_L4"]:
        return "WRONG SUBTYPE (L4)"
    return "UNKNOWN"

failures["failure_type"] = failures.apply(failure_type, axis=1)

print("=" * 65)
print(f"FAILURE ANALYSIS — {len(failures)} mismatches out of {len(df_l1)} obvious codes")
print("=" * 65)

ft = (
    failures.groupby("failure_type")
    .agg(
        count=("TRANCD", "count"),
        volume=("volume", "sum"),
        examples=("TRANCD", lambda x: ", ".join(x.head(4))),
    )
    .sort_values("count", ascending=False)
    .reset_index()
)

print("\nFailure type distribution:")
print(ft.to_string(index=False))

In [None]:
# ── Detailed mismatch table ───────────────────────────────────────

print("\n" + "=" * 65)
print("DETAILED FAILURES (sorted by volume, highest impact first)")
print("=" * 65)

for _, r in failures.sort_values("volume", ascending=False).iterrows():
    print(
        f"\n  TRANCD={r['TRANCD']:>5} | vol={r['volume']:>6,}"
        f" | conf={r['llm_confidence']} | {r['failure_type']}"
    )
    print(f"    sample desc:  {r['sample_desc_1'][:55]}")
    print(f"    gt desc:      {str(r['gt_desc'])[:55]}")

    for lvl in ["L1", "L2", "L3", "L4"]:
        llm_val = str(r[f"llm_{lvl}"])
        gt_val  = str(r[f"gt_{lvl}"])
        ok      = "Y" if r[f"match_{lvl}"] else "X"
        if not r[f"match_{lvl}"]:
            print(f'    {lvl}: [{ok}] LLM="{llm_val}" vs GT="{gt_val}"')
        else:
            print(f'    {lvl}: [{ok}] "{llm_val}"')

In [None]:
# ── Root-cause: uninformative descriptions ────────────────────────

unclassified = df_l1[df_l1["llm_L2"] == "Unclassified"]

print("=" * 65)
print("ROOT CAUSE: Uninformative descriptions (LLM → 'Unclassified')")
print("=" * 65)
if len(unclassified) > 0:
    print(f"\n{len(unclassified)} codes where the sample_desc_1 was an address,")
    print("a person name, or other non-descriptive text:\n")
    for _, r in unclassified.iterrows():
        print(
            f'  TRANCD={r["TRANCD"]:>5} '
            f'| "{r["sample_desc_1"][:35]}" '
            f'| GT: {r["gt_L2"]} > {r["gt_L3"]}'
        )
    vol_pct = unclassified["volume"].sum() / df_l1["volume"].sum() * 100
    print(f"\n  Volume impact: {unclassified['volume'].sum():,} txns ({vol_pct:.1f}%)")
    print("  Fix: Feed the GT description from the Master Fee Table, not raw EFHDS1.")
else:
    print("None — all codes received a classification.")

---
## 9. Summary Report

In [None]:
vol_total   = df_res["volume"].sum()
vol_l1      = df_l1["volume"].sum()
vw_exact    = (df_l1["exact_match"] * df_l1["volume"]).sum() / vol_l1
vw_partial  = (df_l1["partial_match_L1L2"] * df_l1["volume"]).sum() / vol_l1
amb_match   = df_l2["exact_match_any"].mean() if len(df_l2) > 0 else 0
n_uncl      = len(unclassified) if "unclassified" in dir() else 0
n_wrong_blk = len(failures[failures["failure_type"] == "WRONG BLOCK (L1)"])
target_met  = vw_exact >= 0.80
status      = "MET" if target_met else "BELOW TARGET"

print("=" * 65)
print("  TRANSACTION CATEGORIZATION — ACCURACY SUMMARY")
print("=" * 65)
print(f"")
print(f"  Total codes tested:              {len(df_res)}")
print(f"  Total transaction volume:         {vol_total:,}")
print(f"")
print(f"  LAYER 1 — Obvious (single-mapping)")
print(f"    Codes: {len(df_l1)}    Volume: {vol_l1:,}")
print(f"    L1 (Block):           {df_l1['match_L1'].mean():.1%}")
print(f"    L2 (Category):        {df_l1['match_L2'].mean():.1%}")
print(f"    L3 (Channel):         {df_l1['match_L3'].mean():.1%}")
print(f"    L4 (Subtype):         {df_l1['match_L4'].mean():.1%}")
print(f"    Exact Match (all 4):  {df_l1['exact_match'].mean():.1%}  (vol-wt: {vw_exact:.1%})")
print(f"    Partial (L1+L2):      {df_l1['partial_match_L1L2'].mean():.1%}  (vol-wt: {vw_partial:.1%})")
print(f"")
print(f"  LAYER 2 — Ambiguous: {len(df_l2)} codes")
print(f"    Match (any valid):    {amb_match:.1%}")
print(f"")
print(f"  LAYER 3 — Unknown: {len(df_l3)} codes (manual review)")
print(f"")
print(f"  KEY FAILURES:")
print(f"    Uninformative descriptions:  {n_uncl} codes -> 'Unclassified'")
print(f"    Wrong block (L1):            {n_wrong_blk} codes")
print(f"")
print(f"  TARGET: >=80% vol-weighted exact match -> {status}")
print("=" * 65)

---
## 10. Export

In [None]:
# Layer 1 — with all match columns
df_l1.to_csv(f"{OUTPUT_DIR}/eval_layer1_obvious.csv", index=False)

# Layer 2 — ambiguous summary
df_l2.to_csv(f"{OUTPUT_DIR}/eval_layer2_ambiguous.csv", index=False)

# Layer 3 — for manual review
df_l3["review_status"] = "NEEDS MANUAL REVIEW"
df_l3.to_csv(f"{OUTPUT_DIR}/eval_layer3_unknown.csv", index=False)

# Combined
df_all = pd.concat([df_l1, df_l2, df_l3], ignore_index=True, sort=False)
df_all.to_csv(f"{OUTPUT_DIR}/eval_full_report.csv", index=False)

print(f"Reports saved to {OUTPUT_DIR}/")
print(f"  eval_layer1_obvious.csv    ({len(df_l1)} rows)")
print(f"  eval_layer2_ambiguous.csv  ({len(df_l2)} rows)")
print(f"  eval_layer3_unknown.csv    ({len(df_l3)} rows)")
print(f"  eval_full_report.csv       ({len(df_all)} rows)")

---
## 11. Next-Iteration Prompt Improvements

Based on the failure analysis, concrete fixes for the next `ai_query` run:

### Fix 1 — Uninformative descriptions
Codes 237, 242, 261, 283, 299 have street addresses as `sample_desc_1` (e.g., `306 W BROADWAY ST`). The LLM has zero signal and defaults to `Unclassified`.  
**Fix:** Include a TRANCD → canonical-description lookup table in the prompt. Map `237 → ATM W/D`, `283 → ATM Deposit`, etc. from the Master Fee Table.

### Fix 2 — ACH returns mis-classified as NSF/OD
Codes 8 and 59 (`ORIGINATED ACH ITEM RETURNED`, `No Account/Unable to Locate`) were classified as NSF/OD but the GT says `Money movement > ACH`.  
**Fix:** Add few-shot: `"ACH returns and rejects are Money movement > ACH, not NSF/OD."`

### Fix 3 — Account operations vs Money movement
Codes 918/919 (`Trnsfr Frm Act Ending in...`) look like transfers but GT says `Account operations > Closing`. Same with 741/744 (`Investment Sweep`).  
**Fix:** Add rule: `"Account closings, investment sweeps, and IRA distributions are Account operations, not Money movement."`

### Fix 4 — L3 casing drift
The LLM outputs `Debit Card` while the taxonomy uses `Debit card`.  
**Fix:** Add explicit instruction: `"Use these EXACT L3 strings: ACH, ATM, Check, Debit card, Wire, Transfers & Payments, Deposits, Withdrawals, Interest, Closing, Misc."`

### Fix 5 — Code 334 (Service Charge Refund)
The prompt rule "Refunds/Reversals of fees → Block A > Money movement > Deposits" overrode the GT mapping `Fee item > All others > Account Operations`.  
**Fix:** Check with Sid/Mike whether the prompt rule or the GT is correct, then align.