# üè† Kaggle Housing Price Prediction ‚Äî Full loclean Workflow

A realistic **data science** notebook showing how `loclean` accelerates the entire
data-preparation pipeline for a Kaggle-style regression task.

**Pipeline:** Raw messy data ‚Üí Clean ‚Üí Entity Resolution ‚Üí Feature Discovery ‚Üí Quality Validation ‚Üí Model-ready DataFrame

In [None]:
import polars as pl

import loclean

MODEL = "qwen2.5-coder:1.5b"

## 1 ¬∑ Raw Data ‚Äî messy, real-world-like

Simulates what you'd download from a Kaggle competition: inconsistent formatting,
mixed units, duplicate entity names, class imbalance.

In [None]:
raw = pl.DataFrame(
    {
        "address": [
            "123 Main St, Springfield, IL",
            "456 Oak Ave, springfield, illinois",
            "789 Pine Rd, Chicago, IL",
            "321 Elm St, chicago, Illinois",
            "654 Maple Dr, Naperville, IL",
            "987 Cedar Ln, Naperville, IL",
            "111 Birch Way, Joliet, IL",
            "222 Walnut St, JOLIET, Illinois",
            "333 Ash Ct, Peoria, IL",
            "444 Spruce Pl, Rockford, IL",
        ],
        "city": [
            "Springfield",
            "springfield",
            "Chicago",
            "chicago",
            "Naperville",
            "Naperville",
            "Joliet",
            "JOLIET",
            "Peoria",
            "Rockford",
        ],
        "size_raw": [
            "1,200 sqft",
            "1800 sq ft",
            "2400 square feet",
            "950sqft",
            "3,100 sqft",
            "1600 sq. ft.",
            "2800sqft",
            "1100 sqft",
            "2,000 sq ft",
            "1450 sq ft",
        ],
        "bedrooms": [2, 3, 4, 1, 5, 3, 4, 2, 3, 2],
        "bathrooms": [1, 2, 3, 1, 3, 2, 3, 1, 2, 2],
        "year_built": [1990, 2005, 2018, 1975, 2022, 2000, 2015, 1985, 2010, 1995],
        "lot_acres": [0.15, 0.25, 0.40, 0.10, 0.60, 0.20, 0.35, 0.12, 0.30, 0.18],
        "is_luxury": ["no", "no", "yes", "no", "yes", "no", "yes", "no", "no", "no"],
        "price": [
            250_000,
            380_000,
            520_000,
            180_000,
            720_000,
            310_000,
            480_000,
            220_000,
            400_000,
            280_000,
        ],
    }
)

print(f"Shape: {raw.shape}")
print(f"Columns: {raw.columns}")
raw

## 2 ¬∑ Data Cleaning ‚Äî extract numeric values from messy strings

The `size_raw` column has inconsistent formats: `"1,200 sqft"`, `"2400 square feet"`,
`"950sqft"`. `loclean.clean()` uses the LLM to extract the numeric value.

In [None]:
cleaned = loclean.clean(
    raw,
    "size_raw",
    instruction="Extract the numeric square footage value only, as an integer.",
    model=MODEL,
)

print("Before ‚Üí After:")
cleaned.select("size_raw", "clean_value", "clean_unit")

## 3 ¬∑ Entity Resolution ‚Äî canonicalize city names

`"Springfield"` vs `"springfield"` vs `"JOLIET"` vs `"Joliet"` ‚Äî the LLM groups
these into canonical forms automatically.

In [None]:
resolved = loclean.resolve_entities(
    cleaned,
    "city",
    threshold=0.8,
    model=MODEL,
)

print("Entity resolution results:")
resolved.select("city", "city_canonical").unique()

## 4 ¬∑ Feature Discovery ‚Äî LLM-proposed feature engineering

The LLM analyses column types and sample values, then proposes mathematical
transformations (e.g. `price_per_sqft`, `log_price`, `rooms_per_acre`) that
maximise mutual information with the target.

In [None]:
# Use only numeric columns for feature discovery
numeric_df = resolved.select(
    "bedrooms", "bathrooms", "year_built", "lot_acres", "price"
)

enriched = loclean.discover_features(
    numeric_df,
    "price",
    n_features=3,
    max_retries=5,
    model=MODEL,
)

new_cols = [c for c in enriched.columns if c not in numeric_df.columns]
print(f"Discovered {len(new_cols)} features: {new_cols}")
enriched

## 5 ¬∑ Oversampling ‚Äî handle class imbalance

Only 3 of 10 houses are `"luxury"`. The LLM generates realistic synthetic
luxury records to balance the dataset.

In [None]:
from pydantic import BaseModel


class HouseRecord(BaseModel):
    bedrooms: int
    bathrooms: int
    year_built: int
    lot_acres: float
    price: int
    is_luxury: str


luxury_before = raw.filter(pl.col("is_luxury") == "yes").shape[0]
print(f"Luxury houses before: {luxury_before} / {raw.shape[0]}")

oversampled = loclean.oversample(
    raw.select(
        "bedrooms", "bathrooms", "year_built", "lot_acres", "price", "is_luxury"
    ),
    target_col="is_luxury",
    target_value="yes",
    n=4,
    schema=HouseRecord,
    model=MODEL,
)

luxury_after = oversampled.filter(pl.col("is_luxury") == "yes").shape[0]
print(f"Luxury houses after: {luxury_after} / {oversampled.shape[0]}")
oversampled.tail(5)

## 6 ¬∑ Quality Validation ‚Äî LLM-powered data auditing

Define constraints in plain English. The LLM evaluates each row and reports
compliance with reasoning for failures.

In [None]:
report = loclean.validate_quality(
    raw,
    rules=[
        "Price must be a positive number greater than 50,000",
        "Bedrooms must be between 1 and 10",
        "Year built must be between 1800 and 2025",
    ],
    sample_size=10,
    model=MODEL,
)

print(f"Compliance rate: {report['compliance_rate']:.1%}")
print(f"Rows checked: {report['rows_checked']}")

if report.get("failures"):
    print(f"\nFailures ({len(report['failures'])}):\n")
    for f in report["failures"][:3]:
        print(f"  Row {f['row_index']}: {f['rule']}")
        print(f"    Reason: {f['reason']}\n")
else:
    print("\n‚úÖ All rows pass quality validation!")

## 7 ¬∑ Privacy Scrubbing ‚Äî redact PII before sharing

Before sharing the dataset (e.g. uploading to Kaggle), scrub any PII.

In [None]:
original_addresses = raw.select("address").to_series().to_list()

scrubbed = loclean.scrub(
    raw,
    target_col="address",
    mode="mask",
    model=MODEL,
)

scrubbed_addresses = scrubbed.select("address").to_series().to_list()

print("Original ‚Üí Scrubbed:")
for orig, masked in zip(original_addresses[:5], scrubbed_addresses[:5], strict=True):
    print(f"  {orig}")
    print(f"  ‚Üí {masked}\n")

## Summary

| Step | API | What it does |
|------|-----|-------------|
| Clean | `loclean.clean()` | Extract numeric values from messy strings |
| Entity Resolution | `loclean.resolve_entities()` | Canonicalize city names |
| Feature Discovery | `loclean.discover_features()` | LLM-proposed feature engineering |
| Oversampling | `loclean.oversample()` | Generate synthetic minority records |
| Quality Validation | `loclean.validate_quality()` | Data quality audit in plain English |
| Privacy Scrubbing | `loclean.scrub()` | Redact PII before sharing |