# üïµÔ∏è Target Leakage Auditing

Target leakage occurs when features contain information from *after* the
prediction event, making models look accurate during training but fail in
production. `audit_leakage()` uses semantic timeline evaluation to find
and remove leaked columns.

In [None]:
import polars as pl

import loclean

## Create dataset with leakage

A loan approval dataset where `approval_date` and `loan_officer_notes`
are generated **after** the approval decision ‚Äî classic leakage that
wouldn't be available at prediction time.

In [None]:
df = pl.DataFrame(
    {
        "applicant_age": [28, 45, 35, 52, 31, 40, 55, 29, 48, 37],
        "annual_income": [
            45000,
            92000,
            67000,
            115000,
            38000,
            78000,
            98000,
            42000,
            85000,
            61000,
        ],
        "credit_score": [
            680,
            750,
            710,
            800,
            620,
            730,
            770,
            650,
            740,
            700,
        ],
        "debt_to_income": [
            0.35,
            0.22,
            0.28,
            0.15,
            0.42,
            0.25,
            0.18,
            0.38,
            0.20,
            0.30,
        ],
        "approval_date": [
            "2024-03-15",
            "2024-03-16",
            "2024-03-17",
            "2024-03-18",
            None,
            "2024-03-20",
            "2024-03-21",
            None,
            "2024-03-23",
            "2024-03-24",
        ],
        "loan_officer_notes": [
            "Approved ‚Äî good DTI",
            "Approved ‚Äî excellent credit",
            "Approved ‚Äî stable income",
            "Approved ‚Äî premium applicant",
            "Denied ‚Äî high risk",
            "Approved ‚Äî meets criteria",
            "Approved ‚Äî senior applicant",
            "Denied ‚Äî insufficient income",
            "Approved ‚Äî good history",
            "Approved ‚Äî standard case",
        ],
        "approved": [1, 1, 1, 1, 0, 1, 1, 0, 1, 1],
    }
)

print(f"Columns: {df.columns}")
df.head()

## Audit for leakage

The auditor evaluates whether each feature could have been known
**before** the target event. Provide a `domain` hint for better accuracy.

In [None]:
pruned, summary = loclean.audit_leakage(
    df,
    target_col="approved",
    domain="Loan approval prediction",
)

print(f"Columns before: {df.columns}")
print(f"Columns after:  {pruned.columns}")
print(f"Dropped:        {summary['dropped_columns']}")

## Inspect verdicts

Each column gets a verdict with the LLM's timeline reasoning.

In [None]:
for v in summary.get("verdicts", []):
    status = "üö® LEAK" if v["is_leakage"] else "‚úÖ SAFE"
    print(f"{status}  {v['column']}: {v['reason']}")