# 02 — Categorize Products

**Objective:** Classify every product code in the catalog into the StrategyCorp
product taxonomy using `ai_query()` with `responseFormat` for structured JSON output.

### Pipeline

1. Read the `product_code_catalog` from Unity Catalog (produced by `01_prepare_product_data`).
2. Load the product taxonomy markdown as prompt context.
3. Batch codes (10 per prompt, grouped by source_file) and call `ai_query()` with a JSON schema response format.
4. Track token estimates and cost per batch.
5. Save all results to a single `product_classification_results` UC table with `layer`, `prompt_version`, and cost metadata.

**Runs on:** Databricks Runtime 15.4 LTS or above.

In [None]:
# ── Configuration ─────────────────────────────────────────────────
CATALOG_NAME    = "ciq-bp_dummy-dev"
SCHEMA_NAME     = "default"
MODEL_NAME      = "databricks-claude-opus-4-6"
PROMPT_VERSION  = "v1.0"
BATCH_SIZE      = 10
TAXONOMY_PATH   = "../../data/taxonomy/product_categorization_taxonomy.md"

CATALOG_TABLE  = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.product_code_catalog"
RESULTS_TABLE  = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.product_classification_results"

# Token pricing (Databricks Foundation Model Serving — approximate $/1K tokens)
PRICING = {
    "databricks-claude-opus-4-6": {"input": 0.015, "output": 0.075},
    "databricks-meta-llama-3-1-70b-instruct": {"input": 0.001, "output": 0.001},
}

print(f"Model:          {MODEL_NAME}")
print(f"Prompt version: {PROMPT_VERSION}")
print(f"Batch size:     {BATCH_SIZE}")
print(f"Catalog table:  {CATALOG_TABLE}")
print(f"Results table:  {RESULTS_TABLE}")

---
## Step 1 — Validate upstream tables

In [None]:
import pandas as pd
import json
from datetime import datetime

try:
    catalog_sdf = spark.table(CATALOG_TABLE)
    catalog_cols = [f.name for f in catalog_sdf.schema.fields]
    assert "layer" in catalog_cols, "Missing 'layer' column in catalog — run 01_prepare_product_data first"
    assert "product_code" in catalog_cols, "Missing 'product_code' column in catalog"
    assert "product_name" in catalog_cols, "Missing 'product_name' column in catalog"

    df_catalog = catalog_sdf.toPandas()
    df_catalog["product_code"] = df_catalog["product_code"].astype(str).str.strip()

    print(f"Loaded {len(df_catalog)} codes from {CATALOG_TABLE}")
    print(f"Columns: {list(df_catalog.columns)}")
    print(f"\nLayer distribution:")
    for layer in sorted(df_catalog["layer"].unique()):
        n = len(df_catalog[df_catalog["layer"] == layer])
        print(f"  Layer {layer}: {n} codes")
    print(f"\nSource file distribution:")
    print(df_catalog["source_file"].value_counts().to_string())
except NameError:
    print("Spark session not found — run this notebook in Databricks.")
    raise SystemExit("Requires Databricks environment.")

---
## Step 2 — Load taxonomy and build system prompt

In [None]:
import os

if not os.path.exists(TAXONOMY_PATH):
    raise FileNotFoundError(f"Taxonomy file not found: {TAXONOMY_PATH}")

with open(TAXONOMY_PATH, "r") as f:
    taxonomy_md = f.read()

print(f"Taxonomy loaded: {len(taxonomy_md)} chars (~{len(taxonomy_md)//4} tokens)")

In [None]:
SYSTEM_PROMPT = f"""You are a product categorization engine for a US bank.
Given a list of product codes with descriptions and context, classify each into the
StrategyCorp product taxonomy below.

{taxonomy_md}

### Rules:
1. First determine the Line of Business (Level 1): Retail, Business, or Wealth Management.
   - For loans: use PURCOD (purpose code) when available. Codes 01-02 = Retail, 03-06/09 = Business.
   - For deposits: product names containing "BUSINESS", "BUS", "COMMERCIAL", "CORPORATE" = Business.
   - Products with "FIDUCIARY", "WMG", "Wealth" = Wealth Management.
   - Public Funds products are ambiguous — classify as Business if unclear.
2. Determine Product Type (Level 2): Deposits, Loans, Services, Cash Management Services, Securities.
3. Determine Category (Level 3) based on the product description and type code.
4. Propose Sub-category (Level 4) when the description provides enough context:
   - Deposits: Interest Bearing, Non-Interest Bearing, IOLTA, etc.
   - Loans: LOC, Loan, specific sub-types.
   - Leave null if insufficient information.
5. Propose Special (Level 5) when applicable (e.g., ARM, Fixed, Custodial, <$1M, >$1M).
   Leave null if insufficient information.
6. Use EXACT strings from the taxonomy for Levels 1-3. Levels 4-5 are FI-configurable.
7. Use "Unclassified" for any level if no mapping fits. Do not guess.
8. `confidence`: 0.0 to 1.0 — your certainty in the classification.

### Few-Shot Examples:

Input: product_code=01, DESC="PERSONAL CHECKING", DOMAIN=Deposit
Output: {{{{
  "product_code": "01",
  "line_of_business": "Retail",
  "product_type": "Deposits",
  "product_category": "Checking (DDA)",
  "product_subcategory": "Non-interest bearing",
  "product_special": null,
  "confidence": 0.98
}}}}

Input: product_code=B5, DESC="COMMERCIAL ANALYSIS", DOMAIN=Deposit
Output: {{{{
  "product_code": "B5",
  "line_of_business": "Business",
  "product_type": "Deposits",
  "product_category": "Checking (DDA)",
  "product_subcategory": "Analyzed",
  "product_special": null,
  "confidence": 0.95
}}}}

Input: product_code=R1, DESC="Res RE - Fixed", DOMAIN=Loan, PURCOD=02, PURPOSE="Real Estate — Residential"
Output: {{{{
  "product_code": "R1",
  "line_of_business": "Retail",
  "product_type": "Loans",
  "product_category": "Mortgage Loans",
  "product_subcategory": "Conforming",
  "product_special": "Fixed",
  "confidence": 0.92
}}}}
"""

print(f"System prompt: {len(SYSTEM_PROMPT)} chars (~{len(SYSTEM_PROMPT)//4} tokens)")

---
## Step 3 — Define responseFormat JSON schema

In [None]:
RESPONSE_SCHEMA = {
    "type": "json_schema",
    "json_schema": {
        "name": "product_classifications",
        "schema": {
            "type": "object",
            "properties": {
                "classifications": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "product_code":        {"type": "string"},
                            "line_of_business":    {"type": "string"},
                            "product_type":        {"type": "string"},
                            "product_category":    {"type": "string"},
                            "product_subcategory": {"type": ["string", "null"]},
                            "product_special":     {"type": ["string", "null"]},
                            "confidence":          {"type": "number"}
                        },
                        "required": [
                            "product_code", "line_of_business", "product_type",
                            "product_category", "product_subcategory",
                            "product_special", "confidence"
                        ]
                    }
                }
            },
            "required": ["classifications"]
        },
        "strict": True
    }
}

response_schema_str = json.dumps(RESPONSE_SCHEMA)
print(f"Response schema ready ({len(response_schema_str)} chars)")
print(json.dumps(RESPONSE_SCHEMA, indent=2))

---
## Step 4 — Batch classification engine

In [None]:
def build_user_prompt(batch_df):
    """Build the user prompt listing product codes for a single batch."""
    lines = []
    for _, row in batch_df.iterrows():
        desc = str(row["product_name"]).strip()
        if desc in ("nan", "", "None", "(unknown)"):
            desc = "(no description)"

        domain = str(row["source_file"]).strip()
        parts = [f'product_code={row["product_code"]}', f'DESC="{desc}"', f'DOMAIN={domain}']

        # Add loan context when available
        purcod = row.get("sample_purcod")
        if pd.notna(purcod) and str(purcod).strip() not in ("", "nan", "None"):
            parts.append(f'PURCOD={purcod}')
        purpose = row.get("sample_purpose_desc")
        if pd.notna(purpose) and str(purpose).strip() not in ("", "nan", "None"):
            parts.append(f'PURPOSE="{str(purpose).strip()}"')

        lines.append(", ".join(parts))

    codes_text = "\n".join(f"  - {line}" for line in lines)
    return (
        f"Classify these {len(batch_df)} product codes:\n{codes_text}\n\n"
        f'Return a JSON object with a "classifications" array containing one entry per code.'
    )


def estimate_tokens(text):
    """Rough token estimate: ~4 chars per token."""
    return len(text) // 4


def estimate_cost(tokens_in, tokens_out, model_name):
    """Calculate estimated cost in USD."""
    prices = PRICING.get(model_name, {"input": 0.0, "output": 0.0})
    return (tokens_in * prices["input"] + tokens_out * prices["output"]) / 1000


# Quick test
test_batch = df_catalog.head(3)
print("Sample user prompt:")
print(build_user_prompt(test_batch))

In [None]:
def classify_batch(batch_df, batch_num, source_file):
    """
    Classify a batch of product codes via ai_query().
    Returns a list of result dicts with metadata.
    """
    user_prompt = build_user_prompt(batch_df)
    full_prompt = SYSTEM_PROMPT + "\n" + user_prompt
    escaped = full_prompt.replace("'", "''")
    escaped_schema = response_schema_str.replace("'", "''")

    query = f"""
    SELECT ai_query(
        '{MODEL_NAME}',
        '{escaped}',
        responseFormat => '{escaped_schema}'
    ) as result
    """

    tokens_in = estimate_tokens(full_prompt)

    print(f"  Batch {batch_num} | {source_file} | {len(batch_df)} codes | ~{tokens_in} tokens in")

    result_raw = spark.sql(query).collect()[0]["result"]
    tokens_out = estimate_tokens(result_raw)
    cost = estimate_cost(tokens_in, tokens_out, MODEL_NAME)

    parsed = json.loads(result_raw)
    classifications = parsed.get("classifications", [])

    run_ts = datetime.utcnow().isoformat()

    batch_lookup = batch_df.set_index("product_code").to_dict(orient="index")

    results = []
    for cls in classifications:
        pc = str(cls["product_code"]).strip()
        meta = batch_lookup.get(pc, {})
        results.append({
            "product_code":        pc,
            "product_name":        meta.get("product_name", ""),
            "source_file":         meta.get("source_file", source_file),
            "account_count":       meta.get("account_count", 0),
            "layer":               meta.get("layer", 3),
            "line_of_business":    cls.get("line_of_business"),
            "product_type":        cls.get("product_type"),
            "product_category":    cls.get("product_category"),
            "product_subcategory": cls.get("product_subcategory"),
            "product_special":     cls.get("product_special"),
            "confidence":          cls.get("confidence"),
            "prompt_version":      PROMPT_VERSION,
            "model_name":          MODEL_NAME,
            "run_timestamp":       run_ts,
            "tokens_in":           tokens_in,
            "tokens_out":          tokens_out,
            "estimated_cost":      cost,
            "llm_raw":             result_raw,
        })

    print(f"    -> Parsed {len(results)} classifications | ~{tokens_out} tokens out | ${cost:.4f}")
    return results

print("classify_batch() defined.")

---
## Step 5 — Run classification by source file

In [None]:
import math

all_results = []
layer_names = {1: "Obvious", 2: "Ambiguous", 3: "Unknown"}

for source_file in ["Deposit", "Loan", "CD"]:
    source_df = df_catalog[df_catalog["source_file"] == source_file].reset_index(drop=True)
    if len(source_df) == 0:
        print(f"\nSkipping {source_file}: no codes")
        continue

    n_batches = math.ceil(len(source_df) / BATCH_SIZE)

    print(f"\n{'='*60}")
    print(f"{source_file} | {len(source_df)} codes | {n_batches} batches")
    print(f"{'='*60}")

    for i in range(n_batches):
        batch = source_df.iloc[i * BATCH_SIZE : (i + 1) * BATCH_SIZE]
        try:
            batch_results = classify_batch(batch, i + 1, source_file)
            all_results.extend(batch_results)
        except Exception as e:
            print(f"    ERROR in batch {i+1}: {e}")
            for _, row in batch.iterrows():
                all_results.append({
                    "product_code":        str(row["product_code"]),
                    "product_name":        row.get("product_name", ""),
                    "source_file":         source_file,
                    "account_count":       row.get("account_count", 0),
                    "layer":               row.get("layer", 3),
                    "line_of_business":    "ERROR",
                    "product_type":        str(e)[:200],
                    "product_category":    None,
                    "product_subcategory": None,
                    "product_special":     None,
                    "confidence":          0.0,
                    "prompt_version":      PROMPT_VERSION,
                    "model_name":          MODEL_NAME,
                    "run_timestamp":       datetime.utcnow().isoformat(),
                    "tokens_in":           0,
                    "tokens_out":          0,
                    "estimated_cost":      0.0,
                    "llm_raw":             str(e),
                })

print(f"\n{'='*60}")
print(f"TOTAL: {len(all_results)} classifications collected")
print(f"{'='*60}")

---
## Step 6 — Results summary

In [None]:
df_results = pd.DataFrame(all_results)

print(f"Total results: {len(df_results)}")
print(f"Columns: {list(df_results.columns)}\n")

for source in ["Deposit", "Loan", "CD"]:
    src_df = df_results[df_results["source_file"] == source]
    errors = len(src_df[src_df["line_of_business"] == "ERROR"])

    print(f"{source}:")
    print(f"  Codes:      {len(src_df)}")
    print(f"  Errors:     {errors}")
    print(f"  Tokens in:  {src_df['tokens_in'].sum():,}")
    print(f"  Tokens out: {src_df['tokens_out'].sum():,}")
    print(f"  Est. cost:  ${src_df['estimated_cost'].sum():.4f}")
    print()

total_cost = df_results["estimated_cost"].sum()
print(f"Total estimated cost: ${total_cost:.4f}")

In [None]:
valid = df_results[df_results["line_of_business"] != "ERROR"]

print("Line of Business (L1) distribution:")
print(valid["line_of_business"].value_counts().to_string())

print("\nProduct Type (L2) distribution:")
print(valid["product_type"].value_counts().to_string())

print("\nProduct Category (L3) distribution:")
print(valid["product_category"].value_counts().to_string())

print("\nConfidence statistics:")
print(valid["confidence"].describe().to_string())

low_conf = valid[valid["confidence"] < 0.8]
if len(low_conf) > 0:
    print(f"\nLow-confidence ({len(low_conf)} codes < 0.8):")
    for _, row in low_conf.iterrows():
        print(
            f"  code={row['product_code']:>3} | conf={row['confidence']:.2f}"
            f" | {row['line_of_business']} > {row['product_type']} > {row['product_category']}"
        )
else:
    print("\nNo low-confidence classifications (all >= 0.8).")

---
## Step 7 — Save results to Unity Catalog

In [None]:
df_to_save = df_results.copy()
df_to_save["account_count"] = df_to_save["account_count"].astype(int)
df_to_save["tokens_in"] = df_to_save["tokens_in"].astype(int)
df_to_save["tokens_out"] = df_to_save["tokens_out"].astype(int)
df_to_save["confidence"] = df_to_save["confidence"].astype(float)
df_to_save["estimated_cost"] = df_to_save["estimated_cost"].astype(float)

try:
    sdf_results = spark.createDataFrame(df_to_save)
    sdf_results.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(RESULTS_TABLE)
    print(f"Saved {len(df_to_save)} rows to {RESULTS_TABLE}")
except NameError:
    print("Spark session not found — skipping UC write.")
    print(f"DataFrame ready with {len(df_to_save)} rows.")

---
## Step 8 — Validation

In [None]:
try:
    count = spark.sql(f"SELECT COUNT(*) as cnt FROM {RESULTS_TABLE}").collect()[0]["cnt"]
    print(f"  OK  {RESULTS_TABLE}: {count} rows")

    result_cols = [f.name for f in spark.table(RESULTS_TABLE).schema.fields]
    required = [
        "product_code", "layer", "line_of_business", "product_type",
        "product_category", "prompt_version", "model_name",
        "tokens_in", "tokens_out", "estimated_cost",
    ]
    missing = [c for c in required if c not in result_cols]
    assert not missing, f"Missing columns: {missing}"
    print(f"  OK  All required columns present")

    err_count = spark.sql(
        f"SELECT COUNT(*) as cnt FROM {RESULTS_TABLE} WHERE line_of_business = 'ERROR'"
    ).collect()[0]["cnt"]
    if err_count > 0:
        print(f"  WARN  {err_count} codes had classification errors")
    else:
        print(f"  OK  No classification errors")

    source_counts = spark.sql(
        f"SELECT source_file, COUNT(*) as cnt FROM {RESULTS_TABLE} GROUP BY source_file ORDER BY source_file"
    ).toPandas()
    print(f"\n  Source file coverage:")
    for _, row in source_counts.iterrows():
        print(f"    {row['source_file']}: {row['cnt']} codes")

    print("\nAll validations passed.")
except NameError:
    print("Spark session not found — skipping validation (run in Databricks).")