# 03 — Categorize Transactions

**Objective:** Classify every transaction code in the catalog into the StrategyCorp
taxonomy using `ai_query()` with `responseFormat` for structured JSON output.

### Pipeline

1. Read the `transaction_code_catalog` from Unity Catalog (produced by `01_prepare_data`).
2. Load the taxonomy markdown as prompt context.
3. Batch codes (15–20 per prompt) and call `ai_query()` with a JSON schema response format.
4. Track token estimates and cost per batch.
5. Save all results to a single `classification_results` UC table with `layer`, `prompt_version`, and cost metadata.

**Runs on:** Databricks Runtime 15.4 LTS or above.

In [0]:
# ── Configuration ─────────────────────────────────────────────────
CATALOG_NAME    = "ciq-bp_dummy-dev"
SCHEMA_NAME     = "default"
MODEL_NAME      = "databricks-claude-opus-4-6"
PROMPT_VERSION  = "v1.0"
BATCH_SIZE      = 15
TAXONOMY_PATH   = "../data/taxonomy/transaction_categorization_taxonomy.md"

CATALOG_TABLE  = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.transaction_code_catalog"
RESULTS_TABLE  = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.classification_results"

# Token pricing (Databricks Foundation Model Serving — approximate $/1K tokens)
PRICING = {
    "databricks-claude-opus-4-6": {"input": 0.015, "output": 0.075},
    "databricks-meta-llama-3-1-70b-instruct": {"input": 0.001, "output": 0.001},
}

print(f"Model:          {MODEL_NAME}")
print(f"Prompt version: {PROMPT_VERSION}")
print(f"Batch size:     {BATCH_SIZE}")
print(f"Catalog table:  {CATALOG_TABLE}")
print(f"Results table:  {RESULTS_TABLE}")

---
## Step 1 — Validate upstream tables

In [0]:
import pandas as pd
import json
from datetime import datetime

try:
    catalog_sdf = spark.table(CATALOG_TABLE)
    catalog_cols = [f.name for f in catalog_sdf.schema.fields]
    assert "layer" in catalog_cols, "Missing 'layer' column in catalog — run 01_prepare_data first"
    assert "transaction_code" in catalog_cols, "Missing 'transaction_code' column in catalog"
    assert "description_1" in catalog_cols, "Missing 'description_1' column in catalog"

    df_catalog = catalog_sdf.toPandas()
    df_catalog["transaction_code"] = df_catalog["transaction_code"].astype(str)

    print(f"Loaded {len(df_catalog)} codes from {CATALOG_TABLE}")
    print(f"Columns: {list(df_catalog.columns)}")
    print(f"\nLayer distribution:")
    for layer in sorted(df_catalog["layer"].unique()):
        n = len(df_catalog[df_catalog["layer"] == layer])
        print(f"  Layer {layer}: {n} codes")
except NameError:
    print("Spark session not found — loading catalog from local notebook 01 output.")
    print("Run this notebook in Databricks for full functionality.")
    raise SystemExit("Requires Databricks environment.")

---
## Step 2 — Load taxonomy and build system prompt

In [0]:
import os

if not os.path.exists(TAXONOMY_PATH):
    raise FileNotFoundError(f"Taxonomy file not found: {TAXONOMY_PATH}")

with open(TAXONOMY_PATH, "r") as f:
    taxonomy_md = f.read()

print(f"Taxonomy loaded: {len(taxonomy_md)} chars (~{len(taxonomy_md)//4} tokens)")

In [0]:
SYSTEM_PROMPT = f"""You are a transaction categorization engine for a US bank.
Given a list of transaction codes (transaction_code) with descriptions, classify each into the
StrategyCorp taxonomy below.

{taxonomy_md}

### Rules:
1. First determine Block A (Non-fee item) or Block B (Fee item). Fee items typically
   contain: "fee", "charge", "surcharge", "penalty", "service charge", "reversal".
2. Refunds/Reversals of fees: Must be Block A > Money movement > Deposits.
3. Classify through Level 2 > Level 3 > Level 4. Use EXACT strings from the taxonomy.
4. Use "Unclassified" if no mapping fits. Do not guess.
5. `include_in_scoring` (Block A only): true for NSF/OD and Money movement; false for
   Account operations, Misc, Unclassified. For Block B, set to false.
6. `credit_debit`: "Credit" for money into the account, "Debit" for money out.
   If unclear, use "Debit".
7. `confidence`: 0.0 to 1.0 — your certainty in the classification.

### Few-Shot Examples:

Input: transaction_code=183, DESC="ACH Debit - SERMONS"
Output: {{{{
  "transaction_code": "183",
  "category_1": "Non-fee item",
  "category_2": "Money movement",
  "category_3": "ACH",
  "category_4": null,
  "include_in_scoring": true,
  "credit_debit": "Debit",
  "confidence": 0.99
}}}}

Input: transaction_code=299, DESC="ATM Service Charge"
Output: {{{{
  "transaction_code": "299",
  "category_1": "Fee item",
  "category_2": "All others",
  "category_3": "Money movement",
  "category_4": "ATM",
  "include_in_scoring": false,
  "credit_debit": "Debit",
  "confidence": 0.98
}}}}

Input: transaction_code=141, DESC="Transfer from DDA"
Output: {{{{
  "transaction_code": "141",
  "category_1": "Non-fee item",
  "category_2": "Money movement",
  "category_3": "Transfers & Payments",
  "category_4": null,
  "include_in_scoring": true,
  "credit_debit": "Credit",
  "confidence": 0.95
}}}}
"""

print(f"System prompt: {len(SYSTEM_PROMPT)} chars (~{len(SYSTEM_PROMPT)//4} tokens)")

---
## Step 3 — Define responseFormat JSON schema

In [0]:
RESPONSE_SCHEMA = {
    "type": "json_schema",
    "json_schema": {
        "name": "transaction_classifications",
        "schema": {
            "type": "object",
            "properties": {
                "classifications": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "transaction_code":    {"type": "string"},
                            "category_1":          {"type": "string"},
                            "category_2":          {"type": "string"},
                            "category_3":          {"type": ["string", "null"]},
                            "category_4":          {"type": ["string", "null"]},
                            "include_in_scoring":  {"type": "boolean"},
                            "credit_debit":        {"type": "string"},
                            "confidence":          {"type": "number"}
                        },
                        "required": [
                            "transaction_code", "category_1", "category_2",
                            "category_3", "category_4",
                            "include_in_scoring", "credit_debit", "confidence"
                        ]
                    }
                }
            },
            "required": ["classifications"]
        },
        "strict": True
    }
}

response_schema_str = json.dumps(RESPONSE_SCHEMA)
print(f"Response schema ready ({len(response_schema_str)} chars)")
print(json.dumps(RESPONSE_SCHEMA, indent=2))

---
## Step 4 — Batch classification engine

In [0]:
def build_user_prompt(batch_df):
    """Build the user prompt listing transaction codes for a single batch."""
    lines = []
    for _, row in batch_df.iterrows():
        desc = str(row["description_1"]).strip()
        if desc in ("nan", "", "None"):
            desc = "(no description)"
        lines.append(f'transaction_code={row["transaction_code"]}, DESC="{desc}"')

    codes_text = "\n".join(f"  - {line}" for line in lines)
    return f"""Classify these {len(batch_df)} transaction codes:\n{codes_text}\n\nReturn a JSON object with a \"classifications\" array containing one entry per code."""


def estimate_tokens(text):
    """Rough token estimate: ~4 chars per token."""
    return len(text) // 4


def estimate_cost(tokens_in, tokens_out, model_name):
    """Calculate estimated cost in USD."""
    prices = PRICING.get(model_name, {"input": 0.0, "output": 0.0})
    return (tokens_in * prices["input"] + tokens_out * prices["output"]) / 1000


# Quick test
test_batch = df_catalog.head(3)
print("Sample user prompt:")
print(build_user_prompt(test_batch))

In [0]:
def classify_batch(batch_df, batch_num, layer):
    """
    Classify a batch of transaction codes via ai_query().
    Returns a list of result dicts with metadata.
    """
    user_prompt = build_user_prompt(batch_df)
    full_prompt = SYSTEM_PROMPT + "\n" + user_prompt
    escaped = full_prompt.replace("'", "''")
    escaped_schema = response_schema_str.replace("'", "''")

    query = f"""
    SELECT ai_query(
        '{MODEL_NAME}',
        '{escaped}',
        responseFormat => '{escaped_schema}'
    ) as result
    """

    tokens_in = estimate_tokens(full_prompt)

    print(f"  Batch {batch_num} | Layer {layer} | {len(batch_df)} codes | ~{tokens_in} tokens in")

    result_raw = spark.sql(query).collect()[0]["result"]
    tokens_out = estimate_tokens(result_raw)
    cost = estimate_cost(tokens_in, tokens_out, MODEL_NAME)

    parsed = json.loads(result_raw)
    classifications = parsed.get("classifications", [])

    run_ts = datetime.utcnow().isoformat()

    # Build a lookup from the batch for metadata
    batch_lookup = batch_df.set_index("transaction_code").to_dict(orient="index")

    results = []
    for cls in classifications:
        tc = str(cls["transaction_code"])
        meta = batch_lookup.get(tc, {})
        results.append({
            "transaction_code":    tc,
            "description_1":      meta.get("description_1", ""),
            "volume":             meta.get("volume", 0),
            "source_file":        meta.get("source_file", ""),
            "layer":              layer,
            "category_1":         cls.get("category_1"),
            "category_2":         cls.get("category_2"),
            "category_3":         cls.get("category_3"),
            "category_4":         cls.get("category_4"),
            "include_in_scoring": cls.get("include_in_scoring"),
            "credit_debit":       cls.get("credit_debit"),
            "confidence":         cls.get("confidence"),
            "prompt_version":     PROMPT_VERSION,
            "model_name":         MODEL_NAME,
            "run_timestamp":      run_ts,
            "tokens_in":          tokens_in,
            "tokens_out":         tokens_out,
            "estimated_cost":     cost,
            "llm_raw":            result_raw,
        })

    print(f"    → Parsed {len(results)} classifications | ~{tokens_out} tokens out | ${cost:.4f}")
    return results

print("classify_batch() defined.")

---
## Step 5 — Run classification by layer

In [0]:
import math

all_results = []
layer_names = {1: "Obvious", 2: "Ambiguous", 3: "Unknown"}

for layer_num in sorted(df_catalog["layer"].unique()):
    layer_df = df_catalog[df_catalog["layer"] == layer_num].reset_index(drop=True)
    layer_name = layer_names.get(layer_num, f"Layer {layer_num}")
    n_batches = math.ceil(len(layer_df) / BATCH_SIZE)

    print(f"\n{'='*60}")
    print(f"Layer {layer_num} — {layer_name} | {len(layer_df)} codes | {n_batches} batches")
    print(f"{'='*60}")

    for i in range(n_batches):
        batch = layer_df.iloc[i * BATCH_SIZE : (i + 1) * BATCH_SIZE]
        try:
            batch_results = classify_batch(batch, i + 1, layer_num)
            all_results.extend(batch_results)
        except Exception as e:
            print(f"    ERROR in batch {i+1}: {e}")
            # Record failed codes so we know what was missed
            for _, row in batch.iterrows():
                all_results.append({
                    "transaction_code":    str(row["transaction_code"]),
                    "description_1":      row.get("description_1", ""),
                    "volume":             row.get("volume", 0),
                    "source_file":        row.get("source_file", ""),
                    "layer":              layer_num,
                    "category_1":         "ERROR",
                    "category_2":         str(e)[:200],
                    "category_3":         None,
                    "category_4":         None,
                    "include_in_scoring": False,
                    "credit_debit":       None,
                    "confidence":         0.0,
                    "prompt_version":     PROMPT_VERSION,
                    "model_name":         MODEL_NAME,
                    "run_timestamp":      datetime.utcnow().isoformat(),
                    "tokens_in":          0,
                    "tokens_out":         0,
                    "estimated_cost":     0.0,
                    "llm_raw":            str(e),
                })

print(f"\n{'='*60}")
print(f"TOTAL: {len(all_results)} classifications collected")
print(f"{'='*60}")

---
## Step 6 — Results summary

In [0]:
df_results = pd.DataFrame(all_results)

# ── Summary by layer ──────────────────────────────────────────────
print(f"Total results: {len(df_results)}")
print(f"Columns: {list(df_results.columns)}\n")

for layer_num in sorted(df_results["layer"].unique()):
    layer_df = df_results[df_results["layer"] == layer_num]
    layer_name = layer_names.get(layer_num, f"Layer {layer_num}")
    errors = len(layer_df[layer_df["category_1"] == "ERROR"])

    print(f"Layer {layer_num} ({layer_name}):")
    print(f"  Codes:      {len(layer_df)}")
    print(f"  Errors:     {errors}")
    print(f"  Tokens in:  {layer_df['tokens_in'].sum():,}")
    print(f"  Tokens out: {layer_df['tokens_out'].sum():,}")
    print(f"  Est. cost:  ${layer_df['estimated_cost'].sum():.4f}")
    print()

total_cost = df_results["estimated_cost"].sum()
print(f"Total estimated cost: ${total_cost:.4f}")

In [0]:
# ── Category distribution ─────────────────────────────────────────
valid = df_results[df_results["category_1"] != "ERROR"]

print("Category 1 (L1) distribution:")
print(valid["category_1"].value_counts().to_string())

print("\nCategory 2 (L2) distribution:")
print(valid["category_2"].value_counts().to_string())

print("\nConfidence statistics:")
print(valid["confidence"].describe().to_string())

# Show low-confidence classifications
low_conf = valid[valid["confidence"] < 0.8]
if len(low_conf) > 0:
    print(f"\nLow-confidence ({len(low_conf)} codes < 0.8):")
    for _, row in low_conf.iterrows():
        print(f"  transaction_code={row['transaction_code']:>5} | conf={row['confidence']:.2f} | {row['category_1']} > {row['category_2']} > {row['category_3']}")
else:
    print("\nNo low-confidence classifications (all >= 0.8).")

---
## Step 7 — Save results to Unity Catalog

In [0]:
# Convert types for Spark compatibility
df_to_save = df_results.copy()
df_to_save["transaction_code"] = df_to_save["transaction_code"].astype(int)
df_to_save["volume"] = df_to_save["volume"].astype(int)
df_to_save["tokens_in"] = df_to_save["tokens_in"].astype(int)
df_to_save["tokens_out"] = df_to_save["tokens_out"].astype(int)
df_to_save["confidence"] = df_to_save["confidence"].astype(float)
df_to_save["estimated_cost"] = df_to_save["estimated_cost"].astype(float)

try:
    sdf_results = spark.createDataFrame(df_to_save)
    sdf_results.write.mode("overwrite").saveAsTable(RESULTS_TABLE)
    print(f"Saved {len(df_to_save)} rows to {RESULTS_TABLE}")
except NameError:
    print("Spark session not found — skipping UC write.")
    print(f"DataFrame ready with {len(df_to_save)} rows.")

---
## Step 8 — Validation

In [0]:
try:
    count = spark.sql(f"SELECT COUNT(*) as cnt FROM {RESULTS_TABLE}").collect()[0]["cnt"]
    print(f"  OK  {RESULTS_TABLE}: {count} rows")

    result_cols = [f.name for f in spark.table(RESULTS_TABLE).schema.fields]
    required = ["transaction_code", "layer", "category_1", "category_2", "prompt_version",
                "model_name", "tokens_in", "tokens_out", "estimated_cost"]
    missing = [c for c in required if c not in result_cols]
    assert not missing, f"Missing columns: {missing}"
    print(f"  OK  All required columns present")

    # Check for errors
    err_count = spark.sql(
        f"SELECT COUNT(*) as cnt FROM {RESULTS_TABLE} WHERE category_1 = 'ERROR'"
    ).collect()[0]["cnt"]
    if err_count > 0:
        print(f"  WARN  {err_count} codes had classification errors")
    else:
        print(f"  OK  No classification errors")

    # Check layer coverage
    layer_counts = spark.sql(
        f"SELECT layer, COUNT(*) as cnt FROM {RESULTS_TABLE} GROUP BY layer ORDER BY layer"
    ).toPandas()
    print(f"\n  Layer coverage:")
    for _, row in layer_counts.iterrows():
        name = layer_names.get(row["layer"], "?")
        print(f"    Layer {row['layer']} ({name}): {row['cnt']} codes")

    print("\nAll validations passed.")
except NameError:
    print("Spark session not found — skipping validation (run in Databricks).")

---
## (Optional) ai_classify & ai_extract comparison tests

These cells test Databricks task-specific AI functions as alternatives to `ai_query()`.
They are kept for reference and comparison, not part of the main pipeline.

In [0]:
# Test ai_classify: Level 1 (Block A vs Block B)
classify_l1_query = f"""
SELECT
  transaction_code,
  description_1 as description,
  ai_classify(description_1, ARRAY('Non-fee item', 'Fee item')) as l1_prediction
FROM {CATALOG_TABLE}
"""

try:
    print("ai_classify — Level 1 (Block assignment):")
    display(spark.sql(classify_l1_query))
except Exception as e:
    print(f"Error: {e}")

In [0]:
# Test ai_classify: Level 2 categories
l2_labels = ["Money movement", "Account operations", "NSF/OD", "Misc", "Service Charges", "Interchange"]
labels_sql = ", ".join([f"'{l}'" for l in l2_labels])

classify_l2_query = f"""
SELECT
  transaction_code,
  description_1 as description,
  ai_classify(description_1, ARRAY({labels_sql})) as l2_prediction
FROM {CATALOG_TABLE}
"""

try:
    print("ai_classify — Level 2:")
    display(spark.sql(classify_l2_query))
except Exception as e:
    print(f"Error: {e}")

In [0]:
# Test ai_extract: attribute detection
extract_query = f"""
SELECT
  transaction_code,
  description_1 as description,
  ai_extract(description_1, ARRAY('transaction_method', 'is_reversal', 'is_fee', 'is_refund')) as extracted_info
FROM {CATALOG_TABLE}
"""

try:
    print("ai_extract — Attribute detection:")
    display(spark.sql(extract_query))
except Exception as e:
    print(f"Error: {e}")