# LLM-Powered Column Mapping Demo

This notebook shows how to normalize free-form values in a Spark DataFrame with `map_column_with_llm`. The helper can run in dry-run mode (no external calls) or call OpenAI/Azure OpenAI with retries, caching, and usage tracking.

In [None]:
# Databricks notebooks normally provide a SparkSession named `spark`.
# This fallback ensures local execution still works.
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark.sql("SET spark.sql.shuffle.partitions=8")

## Configure LLM Credentials

Enter your credentials below. On Databricks these widgets populate environment variables used by the mapping helper.

In [None]:
import os

if "dbutils" in globals():
    dbutils.widgets.removeAll()
    dbutils.widgets.text("OPENAI_API_KEY", "", "OpenAI API Key")
    dbutils.widgets.text("AZURE_OPENAI_ENDPOINT", "", "Azure OpenAI Endpoint (optional)")
    dbutils.widgets.text("AZURE_OPENAI_KEY", "", "Azure OpenAI Key (optional)")
    dbutils.widgets.text("AZURE_OPENAI_API_VERSION", "2023-05-15", "Azure OpenAI API Version")

    openai_key = dbutils.widgets.get("OPENAI_API_KEY")
    azure_endpoint = dbutils.widgets.get("AZURE_OPENAI_ENDPOINT")
    azure_key = dbutils.widgets.get("AZURE_OPENAI_KEY")
    azure_version = dbutils.widgets.get("AZURE_OPENAI_API_VERSION")
else:
    openai_key = os.environ.get("OPENAI_API_KEY", "")
    azure_endpoint = os.environ.get("AZURE_OPENAI_ENDPOINT", "")
    azure_key = os.environ.get("AZURE_OPENAI_KEY", "")
    azure_version = os.environ.get("AZURE_OPENAI_API_VERSION", "2023-05-15")

if openai_key:
    os.environ["OPENAI_API_KEY"] = openai_key
if azure_endpoint:
    os.environ["AZURE_OPENAI_ENDPOINT"] = azure_endpoint
if azure_key:
    os.environ["AZURE_OPENAI_KEY"] = azure_key
if azure_version:
    os.environ["AZURE_OPENAI_API_VERSION"] = azure_version

## Build a Sample DataFrame

In [None]:
from pyspark.sql import Row

sample_data = [
    Row(id=1, company="OpenAI Inc."),
    Row(id=2, company="Alphabet"),
    Row(id=3, company="Micro Soft"),
    Row(id=4, company="amazon.com"),
    Row(id=5, company=None),
]

df = spark.createDataFrame(sample_data)
display(df)

## Run Dry-Run Mapping

Dry-run mode performs exact case-insensitive matching without calling an LLM. Use this to estimate how many rows already match your target list.

In [None]:
from spark_fuse.utils.transformations import map_column_with_llm

target_companies = ["OpenAI", "Alphabet", "Microsoft", "Amazon"]

dry_run_df = map_column_with_llm(
    df,
    column="company",
    target_values=target_companies,
    dry_run=True,
)
display(dry_run_df)

## Run Live Mapping

With credentials configured, set `dry_run=False` to call the LLM for fuzzy matching.

In [None]:
mapped_df = map_column_with_llm(
    df,
    column="company",
    target_values=target_companies,
    model="gpt-3.5-turbo",
    dry_run=False,
)
display(mapped_df)

## Review Mapping Metrics

In [None]:
mapped_count = mapped_df.filter("company_mapped IS NOT NULL").count()
unmapped_count = mapped_df.filter("company_mapped IS NULL").count()

print(f"Mapped rows: {mapped_count}")
print(f"Unmapped rows: {unmapped_count}")

## Clean Up Widgets

In [None]:
if "dbutils" in globals():
    dbutils.widgets.removeAll()