# LLM-Powered Column Mapping Demo

This notebook shows how to normalize free-form values in a Spark DataFrame with `map_column_with_llm`. The helper can run in dry-run mode (no external calls) or call OpenAI/Azure OpenAI with retries, caching, and usage tracking.

In [None]:
# Databricks notebooks normally provide a SparkSession named `spark`.
# This fallback ensures local execution still works.
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.sql("SET spark.sql.shuffle.partitions=8")

In [None]:
import os
import sys

repo_root = os.path.abspath('..')
src_path = os.path.join(repo_root, 'src')
if src_path not in sys.path:
    sys.path.insert(0, src_path)
print('Added to sys.path:', src_path)


## Configure LLM Credentials

Enter your credentials below. On Databricks these widgets populate environment variables used by the mapping helper. Optionally, provide a secret scope and key names to resolve values from Databricks Secrets.

In [None]:
import os

if "dbutils" in globals():
    dbutils.widgets.removeAll()
    dbutils.widgets.text("OPENAI_API_KEY", "", "OpenAI API Key")
    dbutils.widgets.text("AZURE_OPENAI_ENDPOINT", "", "Azure OpenAI Endpoint (optional)")
    dbutils.widgets.text("AZURE_OPENAI_KEY", "", "Azure OpenAI Key (optional)")
    dbutils.widgets.text("AZURE_OPENAI_API_VERSION", "2023-05-15", "Azure OpenAI API Version")
    dbutils.widgets.text("LLM_SECRET_SCOPE", "", "Secret Scope (optional)")
    dbutils.widgets.text("SECRET_OPENAI_API_KEY", "", "Secret Key: OpenAI API Key")
    dbutils.widgets.text("SECRET_AZURE_ENDPOINT", "", "Secret Key: Azure Endpoint")
    dbutils.widgets.text("SECRET_AZURE_API_KEY", "", "Secret Key: Azure API Key")
    dbutils.widgets.text("SECRET_AZURE_API_VERSION", "", "Secret Key: Azure API Version")

    def _widget(name: str) -> str:
        return dbutils.widgets.get(name).strip()

    scope = _widget("LLM_SECRET_SCOPE")

    def _resolve(widget_name: str, secret_widget: str) -> str:
        value = _widget(widget_name)
        secret_name = _widget(secret_widget) if scope else ""
        if scope and secret_name:
            try:
                secret_value = dbutils.secrets.get(scope=scope, key=secret_name)
                if secret_value:
                    value = secret_value
            except Exception as exc:  # noqa: BLE001
                print(f"Warning: unable to read secret '{secret_name}' from scope '{scope}': {exc}")
        return value

    openai_key = _resolve("OPENAI_API_KEY", "SECRET_OPENAI_API_KEY")
    azure_endpoint = _resolve("AZURE_OPENAI_ENDPOINT", "SECRET_AZURE_ENDPOINT")
    azure_key = _resolve("AZURE_OPENAI_KEY", "SECRET_AZURE_API_KEY")
    azure_version = _resolve("AZURE_OPENAI_API_VERSION", "SECRET_AZURE_API_VERSION") or "2023-05-15"
else:
    openai_key = os.environ.get("OPENAI_API_KEY", "").strip()
    azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT", "").strip()
    azure_key = os.environ.get("AZURE_OPENAI_KEY", "").strip()
    azure_version = os.environ.get("AZURE_OPENAI_API_VERSION", "2023-05-15").strip()

if openai_key:
    os.environ["OPENAI_API_KEY"] = openai_key
if azure_endpoint:
    os.environ["AZURE_OPENAI_ENDPOINT"] = azure_endpoint
if azure_key:
    os.environ["AZURE_OPENAI_KEY"] = azure_key
if azure_version:
    os.environ["AZURE_OPENAI_API_VERSION"] = azure_version


## Build a Sample DataFrame

In [None]:
from pyspark.sql import Row

sample_data = [
    Row(id=1, company="OpenAI Inc."),
    Row(id=2, company="Alphabet"),
    Row(id=3, company="Micro Soft"),
    Row(id=4, company="amazon.com"),
    Row(id=5, company=None),
]

df = spark.createDataFrame(sample_data)
display(df)

## Run Dry-Run Mapping

Dry-run mode performs exact case-insensitive matching without calling an LLM. Use this to estimate how many rows already match your target list.

In [None]:
from spark_fuse.utils.transformations import map_column_with_llm

target_companies = ["OpenAI", "Alphabet", "Microsoft", "Amazon"]

dry_run_df = map_column_with_llm(
    df,
    column="company",
    target_values=target_companies,
    dry_run=True,
    temperature=0.0,
)
display(dry_run_df)


## Run Live Mapping

With credentials configured, set `dry_run=False` to call the LLM for fuzzy matching.

In [None]:
mapped_df = map_column_with_llm(
    df,
    column="company",
    target_values=target_companies,
    model="o4-mini",
    dry_run=False,
    temperature=None,
)
display(mapped_df)


## Review Mapping Metrics

In [None]:
mapped_count = mapped_df.filter("company_mapped IS NOT NULL").count()
unmapped_count = mapped_df.filter("company_mapped IS NULL").count()

print(f"Mapped rows: {mapped_count}")
print(f"Unmapped rows: {unmapped_count}")

## Clean Up Widgets

In [None]:
if "dbutils" in globals():
    dbutils.widgets.removeAll()