# 02 — Map Client Schema (Prototype)

**Objective:** Prototype an AI-based column mapping tool.

**Note:** This logic has been integrated into `01_prepare_data.ipynb` as Section B.0.
This notebook remains as a standalone tool for testing and refining the mapping
prompt independently of the full data preparation pipeline.

### Why this matters

Every FI's core banking system exports transaction data with different column names.
Bank Plus (Jack Henry SilverLake) uses `TRANCD`, `EFHDS1`, `EFHDS2`, etc. Another
FI on FIS or Fiserv might call them `TxnCode`, `Description`, `Memo`. Before the
categorization pipeline can run, we need all client data normalized to a single
universal format.

### Canonical output columns

| Canonical Column | Description |
|-----------------|-------------|
| `transaction_code` | Numeric transaction code (primary key for categorization) |
| `description_1` | Primary transaction description text |
| `description_2` | Secondary description / memo line |
| `amount` | Transaction amount |
| `transaction_date` | Date the transaction occurred |
| `posting_date` | Date the transaction was posted |
| `account_number` | Account identifier |
| `account_status` | Account status code |
| `internal_account` | Internal/alternate account number |
| `transaction_desc` | Transaction type description (may be empty) |

### Approach

1. Take Bank Plus NON_POS data (which has client-specific column names like `TRANCD`, `EFHDS1`).
2. Use `ai_query()` to ask the LLM to map each client column to the canonical schema.
3. Validate against the known correct mapping.
4. Apply the mapping to produce a standardized DataFrame.
5. Save the mapping to Unity Catalog for downstream use.

**Runs on:** Databricks Runtime 15.4 LTS or above.

In [0]:
# ── Configuration ─────────────────────────────────────────────────
CATALOG_NAME = "ciq-bp_dummy-dev"
SCHEMA_NAME  = "default"
MODEL_NAME   = "databricks-claude-opus-4-6"

RAW_NON_POS_PATH = "../data/bank-plus-data/raw/CheckingIQ_NON_POS_Daily_012626_rerun.csv"

MAPPINGS_TABLE = f"`{CATALOG_NAME}`.`{SCHEMA_NAME}`.client_column_mappings"

# Generic canonical schema — universal column names for ALL clients.
# Every client's data gets mapped to these same column names.
CANONICAL_SCHEMA = {
    "transaction_code":  "Numeric transaction code (primary key for categorization)",
    "description_1":     "Primary transaction description text",
    "description_2":     "Secondary description / memo line",
    "amount":            "Transaction amount",
    "transaction_date":  "Date the transaction occurred",
    "posting_date":      "Date the transaction was posted",
    "account_number":    "Account identifier",
    "account_status":    "Account status code",
    "internal_account":  "Internal/alternate account number",
    "transaction_desc":  "Transaction type description (may be empty)",
}

# For Bank Plus, we know the correct mapping from their columns to canonical.
# This is the ground truth used to validate the AI output.
BANK_PLUS_CORRECT_MAPPING = {
    "TRANCD":      "transaction_code",
    "EFHDS1":      "description_1",
    "EFHDS2":      "description_2",
    "AMT":         "amount",
    "TRDATE":      "transaction_date",
    "PostingDate":  "posting_date",
    "ACCTNO":      "account_number",
    "status":      "account_status",
    "Account#":    "internal_account",
    "description": "transaction_desc",
}

print(f"Model: {MODEL_NAME}")
print(f"Output table: {MAPPINGS_TABLE}")
print(f"Canonical columns: {list(CANONICAL_SCHEMA.keys())}")

---
## Step 1 — Load client data sample

We take the first 20 rows of Bank Plus NON_POS data. These columns (`TRANCD`,
`EFHDS1`, `EFHDS2`, etc.) are client-specific — the AI must map them to our
generic canonical column names.

In [0]:
import pandas as pd
import json
import csv

def load_raw_csv_sample(path, n_rows=20):
    """
    Robust CSV loader that handles unquoted commas in EFHDS1/EFHDS2.
    Parses from both ends: front 6 cols + tail N cols are safe,
    middle gets merged back into EFHDS1 + EFHDS2.
    """
    with open(path, "r") as f:
        reader = csv.reader(f)
        header = next(reader)

    header = [c.strip() for c in header]
    n_expected = len(header)
    n_front = 6                        # ACCTNO → AMT
    n_tail  = n_expected - 6 - 2       # Account# → end

    rows = []
    with open(path, "r") as f:
        reader = csv.reader(f)
        next(reader)  # skip header
        for fields in reader:
            if len(rows) >= n_rows:
                break
            n = len(fields)
            if n == n_expected:
                rows.append(fields)
            elif n > n_expected:
                front  = fields[:n_front]
                tail   = fields[n - n_tail:]
                middle = fields[n_front : n - n_tail]
                efhds1 = ",".join(middle[:-1])
                efhds2 = middle[-1]
                rows.append(front + [efhds1, efhds2] + tail)

    return pd.DataFrame(rows, columns=header)


df_original = load_raw_csv_sample(RAW_NON_POS_PATH, n_rows=20)
df_original["TRANCD"] = df_original["TRANCD"].astype(str).str.strip()

print(f"Original columns: {list(df_original.columns)}")
print(f"Sample rows: {len(df_original)}")

# Quick sanity check — values should make sense under each column
print(f"\n  ACCTNO samples:  {df_original['ACCTNO'].head(3).tolist()}")
print(f"  TRANCD samples:  {df_original['TRANCD'].head(3).tolist()}")
print(f"  AMT samples:     {df_original['AMT'].head(3).tolist()}")
print(f"  EFHDS1 samples:  {df_original['EFHDS1'].str.strip().head(3).tolist()}")

df_original.head(5)

In [0]:
# Bank Plus columns are already "client-specific" — no simulation needed.
# The AI must map these raw column names to the generic canonical schema.
df_client = df_original.copy()

print(f"Client columns (Bank Plus / Jack Henry SilverLake):")
print(f"  {list(df_client.columns)}")
print(f"\nKnown correct mapping (ground truth for validation):")
for client_col, canon_col in BANK_PLUS_CORRECT_MAPPING.items():
    print(f"  {client_col:<15} → {canon_col}")

df_client.head(5)

---
## Step 2 — Build the AI mapping prompt

We send the LLM the client's actual column names, a few sample rows, and the
generic canonical schema definition. The LLM must return a JSON object mapping
each client column to a canonical column (or null if no match).

In [0]:
df_client = df_original.copy()

# Build a rich column profile — name + inferred type + sample values
column_profiles = []
for col in df_client.columns:
    vals = df_client[col].dropna().astype(str).str.strip()
    vals = vals[vals != ""].head(5).tolist()

    # Simple type inference
    if all(v.replace(".", "").replace("-", "").isdigit() for v in vals if v):
        if any("." in v for v in vals):
            dtype = "decimal numbers"
        elif all(len(v) <= 3 for v in vals):
            dtype = "small integers"
        else:
            dtype = "integers"
        # Check if looks like dates
        if any("-" in v and len(v) == 10 for v in vals):
            dtype = "dates (YYYY-MM-DD)"
    elif all("-" in v and len(v) == 10 for v in vals if v):
        dtype = "dates (YYYY-MM-DD)"
    else:
        dtype = "text"

    column_profiles.append(f"  - Column: \"{col}\"\n    Data type: {dtype}\n    Sample values: {vals}")

column_profiles_text = "\n".join(column_profiles)

canonical_schema_text = "\n".join(
    f"  - {col}: {desc}" for col, desc in CANONICAL_SCHEMA.items()
)

mapping_prompt = f"""You are a data engineering expert specializing in US banking core systems.

A financial institution has provided a transaction data export. Their column names
are abbreviations from their core banking system (e.g., Jack Henry SilverLake,
Fiserv, FIS). Your task is to map each column to our canonical schema.

### Canonical Schema (target — map TO these names)
{canonical_schema_text}

### Client Data Profile (source — map FROM these columns)
{column_profiles_text}

### Key Banking Domain Knowledge
- Core systems use abbreviations: TRANCD = transaction code, EFHDS = extended
  field header/description, AMT = amount, ACCTNO = account number, etc.
- "EFHDS1" and "EFHDS2" are Jack Henry's names for description lines 1 and 2.
- Transaction files often have TWO date columns: a transaction date and a posting
  date. They may look similar but serve different purposes.
- "status" is typically a small integer (1, 2, 3) indicating account status.
- "description" (if mostly empty) is a legacy field for transaction type description.
- There may be TWO account number columns: an external-facing one and an internal one.

### Few-Shot Example (different bank, similar task)
A Fiserv bank had these columns: TxnCode, Desc1, Desc2, TxnAmt, TxnDate,
PostDate, AcctNum, AcctStat, IntAcct, TxnDesc.

Correct mapping:
  TxnCode  → transaction_code
  Desc1    → description_1
  Desc2    → description_2
  TxnAmt   → amount
  TxnDate  → transaction_date
  PostDate → posting_date
  AcctNum  → account_number
  AcctStat → account_status
  IntAcct  → internal_account
  TxnDesc  → transaction_desc

### Instructions
1. MATCH BY DATA VALUES, not just column names. Look at the sample values to
   determine what each column actually contains.
2. Map each client column to exactly one canonical column, or null if no match.
3. Each canonical column can only be used once.
4. Think step by step:
   - Which column has small integers (like 183, 163)? → transaction_code
   - Which column has dollar amounts (like 258.20)? → amount
   - Which column has text descriptions? → description_1 (primary), description_2 (secondary)
   - Which columns have dates? → the one named for transaction date, the other for posting date
   - Which column has alphanumeric account IDs? → account_number
   - Which column has numeric-only account numbers? → internal_account
   - Which column has single-digit status codes? → account_status
   - Which column is mostly empty text? → transaction_desc

Return ONLY a JSON object. Keys = client column names, values = canonical column names (or null).
"""

print(f"Prompt length: {len(mapping_prompt)} chars (~{len(mapping_prompt)//4} tokens)")
print("\n--- Column profiles sent to LLM ---")
print(column_profiles_text)

---
## Step 3 — Call `ai_query()` for column mapping

In [0]:
import re

escaped_prompt = mapping_prompt.replace("'", "''")

# Build responseFormat: keys = client's column names, values = canonical names or null
client_cols = list(df_client.columns)

# Sanitize column names for JSON schema (replace invalid chars with underscore)
# Pattern: ^[a-zA-Z0-9_.-]{1,64}$
def sanitize_column_name(col_name):
    return re.sub(r'[^a-zA-Z0-9_.-]', '_', col_name)

sanitized_to_original = {sanitize_column_name(col): col for col in client_cols}
properties = {sanitize_column_name(col): {"type": ["string", "null"]} for col in client_cols}

response_schema = json.dumps({
    "type": "json_schema",
    "json_schema": {
        "name": "column_mapping",
        "schema": {
            "type": "object",
            "properties": properties,
        },
        "strict": True,
    }
})

mapping_query = f"""
SELECT ai_query(
    '{MODEL_NAME}',
    '{escaped_prompt}',
    responseFormat => '{response_schema}'
) as mapping_result
"""

try:
    print("Executing ai_query for column mapping...")
    result_df = spark.sql(mapping_query)
    mapping_raw_sanitized = result_df.collect()[0]["mapping_result"]
    
    # Convert sanitized keys back to original column names
    mapping_sanitized = json.loads(mapping_raw_sanitized)
    mapping_raw = json.dumps({
        sanitized_to_original[k]: v for k, v in mapping_sanitized.items()
    })
    
    print(f"\nRaw LLM response:\n{mapping_raw}")
except NameError:
    print("Spark session not found — using mock response for local testing.")
    # Mock: client columns → generic canonical names
    mapping_raw = json.dumps({
        "ACCTNO":      "account_number",
        "status":      "account_status",
        "TRANCD":      "transaction_code",
        "description": "transaction_desc",
        "TRDATE":      "transaction_date",
        "AMT":         "amount",
        "EFHDS1":      "description_1",
        "EFHDS2":      "description_2",
        "Account#":    "internal_account",
        "PostingDate": "posting_date",
    })
    print(f"Mock response:\n{mapping_raw}")

---
## Step 4 — Validate the AI mapping

In [0]:
# Parse the LLM response
ai_mapping = json.loads(mapping_raw)

print("AI-proposed mapping:")
for client_col, canonical_col in ai_mapping.items():
    print(f"  {client_col:<20} → {canonical_col}")

In [0]:
# Compare AI mapping against the known correct mapping for Bank Plus
total = len(BANK_PLUS_CORRECT_MAPPING)
correct = 0
results = []

for client_col, expected_canon in BANK_PLUS_CORRECT_MAPPING.items():
    ai_canon = ai_mapping.get(client_col)
    match = (ai_canon == expected_canon)
    if match:
        correct += 1
    results.append({
        "client_column": client_col,
        "expected": expected_canon,
        "ai_proposed": ai_canon,
        "match": "Y" if match else "X",
    })

accuracy = correct / total * 100

print("=" * 70)
print(f"COLUMN MAPPING VALIDATION — {correct}/{total} correct ({accuracy:.0f}%)")
print("=" * 70)
for r in results:
    status = r['match']
    print(f"  [{status}] {r['client_column']:<15} → AI: {str(r['ai_proposed']):<20} | Expected: {r['expected']}")

if accuracy == 100:
    print("\nAll columns mapped correctly to generic canonical names.")
else:
    print(f"\n{total - correct} column(s) mapped incorrectly.")

---
## Step 5 — Save mapping to Unity Catalog

In [0]:
from datetime import datetime

# Build a DataFrame with the mapping result
mapping_records = []
for client_col, canonical_col in ai_mapping.items():
    mapping_records.append({
        "client_name":    "simulated_bank_plus",
        "client_column":  client_col,
        "canonical_column": canonical_col,
        "model_name":     MODEL_NAME,
        "run_timestamp":  datetime.utcnow().isoformat(),
    })

df_mapping = pd.DataFrame(mapping_records)

try:
    sdf_mapping = spark.createDataFrame(df_mapping)
    sdf_mapping.write.mode("overwrite").saveAsTable(MAPPINGS_TABLE)
    print(f"Saved {len(df_mapping)} column mappings to {MAPPINGS_TABLE}")
except NameError:
    print("Spark session not found — skipping UC write (run in Databricks).")
    print(f"DataFrame ready with {len(df_mapping)} rows:")
    print(df_mapping.to_string(index=False))

---
## Step 6 — Apply mapping (demonstration)

Show how the mapping renames client-specific columns to the generic
canonical names. After this step, every client's data looks the same.

In [0]:
# Apply the AI mapping to rename client columns → generic canonical names
rename_map = {client: canon for client, canon in ai_mapping.items() if canon is not None}
df_canonical = df_client.rename(columns=rename_map)

print("Columns after applying AI mapping:")
print(f"  {list(df_canonical.columns)}")

# Verify all canonical columns needed by the categorization pipeline are present
required_cols = ["transaction_code", "description_1", "amount", "transaction_date", "account_number"]
missing = [c for c in required_cols if c not in df_canonical.columns]

if missing:
    print(f"\nWARNING: Missing required columns after mapping: {missing}")
else:
    print("\nAll required pipeline columns present — ready for categorization.")

df_canonical.head(5)