# 02 — Map Client Schema

**Objective:** Prototype an AI-based column mapping tool that can take a new client's
transaction file — with arbitrary column names — and automatically map each column
to the canonical schema used by the categorization pipeline.

### Why this matters

Every FI's core banking system exports transaction data with different column names.
Bank Plus (Jack Henry SilverLake) uses `TRANCD`, `EFHDS1`, `EFHDS2`, etc. Another
FI on FIS or Fiserv might call them `TxnCode`, `Description`, `Memo`. Before the
categorization pipeline can run, we need a mapping from the client's columns to our
canonical schema.

### Approach

1. Take Bank Plus NON_POS data and **rename columns** to simulate an unknown client.
2. Use `ai_query()` to ask the LLM to propose a column mapping.
3. Validate against the known correct mapping.
4. Save the mapping to Unity Catalog for downstream use.

**Runs on:** Databricks Runtime 15.4 LTS or above.

In [None]:
# ── Configuration ─────────────────────────────────────────────────
CATALOG_NAME = "ciq-bp_dummy-dev"
SCHEMA_NAME  = "default"
MODEL_NAME   = "databricks-claude-opus-4-6"

RAW_NON_POS_PATH = "../data/bank-plus-data/raw/CheckingIQ_NON_POS_Daily_012626_rerun.csv"

MAPPINGS_TABLE = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.client_column_mappings"

# Canonical schema — the column names our pipeline expects
CANONICAL_SCHEMA = {
    "ACCTNO":      "Masked account number",
    "status":       "Account status code",
    "TRANCD":       "Transaction code (numeric, primary key for categorization)",
    "description":  "Transaction description (may be empty)",
    "TRDATE":       "Transaction date",
    "AMT":          "Transaction amount",
    "EFHDS1":       "Extended description line 1 (primary text for categorization)",
    "EFHDS2":       "Extended description line 2 (secondary text)",
    "Account#":     "Internal account number",
    "PostingDate":  "Posting date",
}

print(f"Model: {MODEL_NAME}")
print(f"Output table: {MAPPINGS_TABLE}")

---
## Step 1 — Load sample data and simulate a new client

We take the first 20 rows of Bank Plus NON_POS data and rename columns to simulate
what a different FI's export might look like.

In [None]:
import pandas as pd
import json

df_original = pd.read_csv(RAW_NON_POS_PATH, nrows=20, dtype=str)
df_original.columns = [c.strip() for c in df_original.columns]

print(f"Original columns: {list(df_original.columns)}")
print(f"Sample rows: {len(df_original)}")
df_original.head(5)

In [None]:
# Simulate a new client by renaming columns to plausible alternatives
SIMULATED_RENAME = {
    "ACCTNO":      "Account_Num",
    "status":      "Acct_Status",
    "TRANCD":      "Trans_Code",
    "description": "Trans_Desc",
    "TRDATE":      "Transaction_Date",
    "AMT":         "Amount",
    "EFHDS1":      "Description_1",
    "EFHDS2":      "Memo_Line",
    "Account#":    "Internal_Acct",
    "PostingDate": "Post_Date",
}

# The reverse mapping is the ground truth for validation
CORRECT_MAPPING = {v: k for k, v in SIMULATED_RENAME.items()}

df_simulated = df_original.rename(columns=SIMULATED_RENAME)

print(f"Simulated client columns: {list(df_simulated.columns)}")
print(f"\nCorrect mapping (ground truth for validation):")
for sim, canon in CORRECT_MAPPING.items():
    print(f"  {sim:<20} → {canon}")

df_simulated.head(5)

---
## Step 2 — Build the AI mapping prompt

We send the LLM the simulated column names, a few sample rows, and the canonical
schema definition. The LLM must return a JSON mapping.

In [None]:
# Format sample data as a compact table for the prompt
sample_rows = df_simulated.head(5).to_csv(index=False)

canonical_schema_text = "\n".join(
    f"  - {col}: {desc}" for col, desc in CANONICAL_SCHEMA.items()
)

mapping_prompt = f"""You are a data engineering assistant for a banking analytics platform.

A new financial institution has provided a transaction data file. Your task is to map
each column in their file to the corresponding column in our canonical schema.

### Canonical Schema (target columns)
{canonical_schema_text}

### Client's Column Names
{list(df_simulated.columns)}

### Sample Data (first 5 rows)
{sample_rows}

### Instructions
1. Examine each client column name and its sample values.
2. Map each client column to the most appropriate canonical column.
3. If a client column has no match in the canonical schema, map it to null.
4. Every canonical column should be mapped to at most one client column.
5. Return ONLY a JSON object where keys are the client's column names and values
   are the canonical column names (or null if no match).
"""

print(f"Prompt length: {len(mapping_prompt)} chars (~{len(mapping_prompt)//4} tokens)")
print("\n--- Prompt preview (first 500 chars) ---")
print(mapping_prompt[:500])

---
## Step 3 — Call `ai_query()` for column mapping

In [None]:
escaped_prompt = mapping_prompt.replace("'", "''")

# Build the canonical schema as a responseFormat JSON schema
client_cols = list(df_simulated.columns)
properties = {col: {"type": ["string", "null"]} for col in client_cols}

response_schema = json.dumps({
    "type": "json_schema",
    "json_schema": {
        "name": "column_mapping",
        "schema": {
            "type": "object",
            "properties": properties,
        },
        "strict": True,
    }
})

mapping_query = f"""
SELECT ai_query(
    '{MODEL_NAME}',
    '{escaped_prompt}',
    responseFormat => '{response_schema}'
) as mapping_result
"""

try:
    print("Executing ai_query for column mapping...")
    result_df = spark.sql(mapping_query)
    mapping_raw = result_df.collect()[0]["mapping_result"]
    print(f"\nRaw LLM response:\n{mapping_raw}")
except NameError:
    print("Spark session not found — using mock response for local testing.")
    mapping_raw = json.dumps({
        "Account_Num":       "ACCTNO",
        "Acct_Status":       "status",
        "Trans_Code":        "TRANCD",
        "Trans_Desc":        "description",
        "Transaction_Date":  "TRDATE",
        "Amount":            "AMT",
        "Description_1":     "EFHDS1",
        "Memo_Line":         "EFHDS2",
        "Internal_Acct":     "Account#",
        "Post_Date":         "PostingDate",
    })
    print(f"Mock response:\n{mapping_raw}")

---
## Step 4 — Validate the AI mapping

In [None]:
# Parse the LLM response
ai_mapping = json.loads(mapping_raw)

print("AI-proposed mapping:")
for client_col, canonical_col in ai_mapping.items():
    print(f"  {client_col:<20} → {canonical_col}")

In [None]:
# Compare AI mapping against the known correct mapping
total = len(CORRECT_MAPPING)
correct = 0
results = []

for client_col, expected_canon in CORRECT_MAPPING.items():
    ai_canon = ai_mapping.get(client_col)
    match = (ai_canon == expected_canon)
    if match:
        correct += 1
    results.append({
        "client_column": client_col,
        "expected": expected_canon,
        "ai_proposed": ai_canon,
        "match": "Y" if match else "X",
    })

accuracy = correct / total * 100

print("=" * 65)
print(f"COLUMN MAPPING VALIDATION — {correct}/{total} correct ({accuracy:.0f}%)")
print("=" * 65)
for r in results:
    status = r['match']
    print(f"  [{status}] {r['client_column']:<20} → AI: {str(r['ai_proposed']):<15} | Expected: {r['expected']}")

if accuracy == 100:
    print("\nAll columns mapped correctly.")
else:
    print(f"\n{total - correct} column(s) mapped incorrectly.")

---
## Step 5 — Save mapping to Unity Catalog

In [None]:
from datetime import datetime

# Build a DataFrame with the mapping result
mapping_records = []
for client_col, canonical_col in ai_mapping.items():
    mapping_records.append({
        "client_name":    "simulated_bank_plus",
        "client_column":  client_col,
        "canonical_column": canonical_col,
        "model_name":     MODEL_NAME,
        "run_timestamp":  datetime.utcnow().isoformat(),
    })

df_mapping = pd.DataFrame(mapping_records)

try:
    sdf_mapping = spark.createDataFrame(df_mapping)
    sdf_mapping.write.mode("overwrite").saveAsTable(MAPPINGS_TABLE)
    print(f"Saved {len(df_mapping)} column mappings to {MAPPINGS_TABLE}")
except NameError:
    print("Spark session not found — skipping UC write (run in Databricks).")
    print(f"DataFrame ready with {len(df_mapping)} rows:")
    print(df_mapping.to_string(index=False))

---
## Step 6 — Apply mapping (demonstration)

Show how the mapping would be applied to rename a client's file
into the canonical schema for downstream processing.

In [None]:
# Apply the AI mapping to rename columns back to canonical
rename_map = {client: canon for client, canon in ai_mapping.items() if canon is not None}
df_renamed = df_simulated.rename(columns=rename_map)

print("Columns after applying AI mapping:")
print(f"  {list(df_renamed.columns)}")

# Verify all canonical columns needed by the pipeline are present
required_cols = ["TRANCD", "EFHDS1", "AMT", "TRDATE", "ACCTNO"]
missing = [c for c in required_cols if c not in df_renamed.columns]

if missing:
    print(f"\nWARNING: Missing required columns after mapping: {missing}")
else:
    print("\nAll required pipeline columns present — ready for categorization.")

df_renamed.head(5)