# 02 — Map Client Schema

**Objective:** Prototype an AI-based column mapping tool that can take a new client's
transaction file — with arbitrary column names — and automatically map each column
to a **generic, standardized schema** that is the same for every client.

### Why this matters

Every FI's core banking system exports transaction data with different column names.
Bank Plus (Jack Henry SilverLake) uses `TRANCD`, `EFHDS1`, `EFHDS2`, etc. Another
FI on FIS or Fiserv might call them `TxnCode`, `Description`, `Memo`. Before the
categorization pipeline can run, we need all client data normalized to a single
universal format.

### Canonical output columns

| Canonical Column | Description |
|-----------------|-------------|
| `transaction_code` | Numeric transaction code (primary key for categorization) |
| `description_1` | Primary transaction description text |
| `description_2` | Secondary description / memo line |
| `amount` | Transaction amount |
| `transaction_date` | Date the transaction occurred |
| `posting_date` | Date the transaction was posted |
| `account_number` | Account identifier |
| `account_status` | Account status code |
| `internal_account` | Internal/alternate account number |
| `transaction_desc` | Transaction type description (may be empty) |

### Approach

1. Take Bank Plus NON_POS data (which has client-specific column names like `TRANCD`, `EFHDS1`).
2. Use `ai_query()` to ask the LLM to map each client column to the canonical schema.
3. Validate against the known correct mapping.
4. Apply the mapping to produce a standardized DataFrame.
5. Save the mapping to Unity Catalog for downstream use.

**Runs on:** Databricks Runtime 15.4 LTS or above.

In [None]:
# ── Configuration ─────────────────────────────────────────────────
CATALOG_NAME = "ciq-bp_dummy-dev"
SCHEMA_NAME  = "default"
MODEL_NAME   = "databricks-claude-opus-4-6"

RAW_NON_POS_PATH = "../data/bank-plus-data/raw/CheckingIQ_NON_POS_Daily_012626_rerun.csv"

MAPPINGS_TABLE = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.client_column_mappings"

# Generic canonical schema — universal column names for ALL clients.
# Every client's data gets mapped to these same column names.
CANONICAL_SCHEMA = {
    "transaction_code":  "Numeric transaction code (primary key for categorization)",
    "description_1":     "Primary transaction description text",
    "description_2":     "Secondary description / memo line",
    "amount":            "Transaction amount",
    "transaction_date":  "Date the transaction occurred",
    "posting_date":      "Date the transaction was posted",
    "account_number":    "Account identifier",
    "account_status":    "Account status code",
    "internal_account":  "Internal/alternate account number",
    "transaction_desc":  "Transaction type description (may be empty)",
}

# For Bank Plus, we know the correct mapping from their columns to canonical.
# This is the ground truth used to validate the AI output.
BANK_PLUS_CORRECT_MAPPING = {
    "TRANCD":      "transaction_code",
    "EFHDS1":      "description_1",
    "EFHDS2":      "description_2",
    "AMT":         "amount",
    "TRDATE":      "transaction_date",
    "PostingDate":  "posting_date",
    "ACCTNO":      "account_number",
    "status":      "account_status",
    "Account#":    "internal_account",
    "description": "transaction_desc",
}

print(f"Model: {MODEL_NAME}")
print(f"Output table: {MAPPINGS_TABLE}")
print(f"Canonical columns: {list(CANONICAL_SCHEMA.keys())}")

---
## Step 1 — Load client data sample

We take the first 20 rows of Bank Plus NON_POS data. These columns (`TRANCD`,
`EFHDS1`, `EFHDS2`, etc.) are client-specific — the AI must map them to our
generic canonical column names.

In [None]:
import pandas as pd
import json

df_original = pd.read_csv(RAW_NON_POS_PATH, nrows=20, dtype=str)
df_original.columns = [c.strip() for c in df_original.columns]

print(f"Original columns: {list(df_original.columns)}")
print(f"Sample rows: {len(df_original)}")
df_original.head(5)

In [None]:
# Bank Plus columns are already "client-specific" — no simulation needed.
# The AI must map these raw column names to the generic canonical schema.
df_client = df_original.copy()

print(f"Client columns (Bank Plus / Jack Henry SilverLake):")
print(f"  {list(df_client.columns)}")
print(f"\nKnown correct mapping (ground truth for validation):")
for client_col, canon_col in BANK_PLUS_CORRECT_MAPPING.items():
    print(f"  {client_col:<15} → {canon_col}")

df_client.head(5)

---
## Step 2 — Build the AI mapping prompt

We send the LLM the client's actual column names, a few sample rows, and the
generic canonical schema definition. The LLM must return a JSON object mapping
each client column to a canonical column (or null if no match).

In [None]:
sample_rows = df_client.head(5).to_csv(index=False)

canonical_schema_text = "\n".join(
    f"  - {col}: {desc}" for col, desc in CANONICAL_SCHEMA.items()
)

mapping_prompt = f"""You are a data engineering assistant for a banking analytics platform.

A new financial institution has provided a transaction data file. Your task is to map
each column in their file to the corresponding column in our standardized canonical
schema. The canonical schema uses generic, descriptive names — NOT the client's
original column names.

### Canonical Schema (target columns — these are the OUTPUT names)
{canonical_schema_text}

### Client's Column Names (these are the INPUT names to map FROM)
{list(df_client.columns)}

### Sample Data (first 5 rows)
{sample_rows}

### Instructions
1. Examine each client column name and its sample values.
2. Map each client column to the most appropriate canonical column.
3. If a client column has no match in the canonical schema, map it to null.
4. Every canonical column should be mapped to at most one client column.
5. Return ONLY a JSON object where keys are the client's column names and values
   are the canonical column names (or null if no match).

IMPORTANT: The values in your JSON must be the canonical column names (e.g.
"transaction_code", "description_1", "amount"), NOT the client's original names.
"""

print(f"Prompt length: {len(mapping_prompt)} chars (~{len(mapping_prompt)//4} tokens)")
print("\n--- Prompt preview (first 500 chars) ---")
print(mapping_prompt[:500])

---
## Step 3 — Call `ai_query()` for column mapping

In [None]:
escaped_prompt = mapping_prompt.replace("'", "''")

# Build responseFormat: keys = client's column names, values = canonical names or null
client_cols = list(df_client.columns)
properties = {col: {"type": ["string", "null"]} for col in client_cols}

response_schema = json.dumps({
    "type": "json_schema",
    "json_schema": {
        "name": "column_mapping",
        "schema": {
            "type": "object",
            "properties": properties,
        },
        "strict": True,
    }
})

mapping_query = f"""
SELECT ai_query(
    '{MODEL_NAME}',
    '{escaped_prompt}',
    responseFormat => '{response_schema}'
) as mapping_result
"""

try:
    print("Executing ai_query for column mapping...")
    result_df = spark.sql(mapping_query)
    mapping_raw = result_df.collect()[0]["mapping_result"]
    print(f"\nRaw LLM response:\n{mapping_raw}")
except NameError:
    print("Spark session not found — using mock response for local testing.")
    # Mock: client columns → generic canonical names
    mapping_raw = json.dumps({
        "ACCTNO":      "account_number",
        "status":      "account_status",
        "TRANCD":      "transaction_code",
        "description": "transaction_desc",
        "TRDATE":      "transaction_date",
        "AMT":         "amount",
        "EFHDS1":      "description_1",
        "EFHDS2":      "description_2",
        "Account#":    "internal_account",
        "PostingDate": "posting_date",
    })
    print(f"Mock response:\n{mapping_raw}")

---
## Step 4 — Validate the AI mapping

In [None]:
# Parse the LLM response
ai_mapping = json.loads(mapping_raw)

print("AI-proposed mapping:")
for client_col, canonical_col in ai_mapping.items():
    print(f"  {client_col:<20} → {canonical_col}")

In [None]:
# Compare AI mapping against the known correct mapping for Bank Plus
total = len(BANK_PLUS_CORRECT_MAPPING)
correct = 0
results = []

for client_col, expected_canon in BANK_PLUS_CORRECT_MAPPING.items():
    ai_canon = ai_mapping.get(client_col)
    match = (ai_canon == expected_canon)
    if match:
        correct += 1
    results.append({
        "client_column": client_col,
        "expected": expected_canon,
        "ai_proposed": ai_canon,
        "match": "Y" if match else "X",
    })

accuracy = correct / total * 100

print("=" * 70)
print(f"COLUMN MAPPING VALIDATION — {correct}/{total} correct ({accuracy:.0f}%)")
print("=" * 70)
for r in results:
    status = r['match']
    print(f"  [{status}] {r['client_column']:<15} → AI: {str(r['ai_proposed']):<20} | Expected: {r['expected']}")

if accuracy == 100:
    print("\nAll columns mapped correctly to generic canonical names.")
else:
    print(f"\n{total - correct} column(s) mapped incorrectly.")

---
## Step 5 — Save mapping to Unity Catalog

In [None]:
from datetime import datetime

# Build a DataFrame with the mapping result
mapping_records = []
for client_col, canonical_col in ai_mapping.items():
    mapping_records.append({
        "client_name":    "simulated_bank_plus",
        "client_column":  client_col,
        "canonical_column": canonical_col,
        "model_name":     MODEL_NAME,
        "run_timestamp":  datetime.utcnow().isoformat(),
    })

df_mapping = pd.DataFrame(mapping_records)

try:
    sdf_mapping = spark.createDataFrame(df_mapping)
    sdf_mapping.write.mode("overwrite").saveAsTable(MAPPINGS_TABLE)
    print(f"Saved {len(df_mapping)} column mappings to {MAPPINGS_TABLE}")
except NameError:
    print("Spark session not found — skipping UC write (run in Databricks).")
    print(f"DataFrame ready with {len(df_mapping)} rows:")
    print(df_mapping.to_string(index=False))

---
## Step 6 — Apply mapping (demonstration)

Show how the mapping renames client-specific columns to the generic
canonical names. After this step, every client's data looks the same.

In [None]:
# Apply the AI mapping to rename client columns → generic canonical names
rename_map = {client: canon for client, canon in ai_mapping.items() if canon is not None}
df_canonical = df_client.rename(columns=rename_map)

print("Columns after applying AI mapping:")
print(f"  {list(df_canonical.columns)}")

# Verify all canonical columns needed by the categorization pipeline are present
required_cols = ["transaction_code", "description_1", "amount", "transaction_date", "account_number"]
missing = [c for c in required_cols if c not in df_canonical.columns]

if missing:
    print(f"\nWARNING: Missing required columns after mapping: {missing}")
else:
    print("\nAll required pipeline columns present — ready for categorization.")

df_canonical.head(5)