# ðŸ”— Entity Resolution

Canonicalize messy string variations (typos, abbreviations, casing) under a single label using `loclean.resolve_entities`.

**Use case:** You have a column of company names entered by different people â€” some wrote "Google", others "google", "GOOGLE Inc.", or "Alphabet / Google". Entity resolution merges them into one canonical form.

In [None]:
import polars as pl

import loclean

## Create messy data

16 company name variations across 5 real companies:

In [None]:
df = pl.DataFrame(
    {
        "company": [
            "Google LLC",
            "google",
            "GOOGLE Inc.",
            "Alphabet / Google",
            "Microsoft Corp",
            "microsoft",
            "MSFT",
            "Apple Inc.",
            "apple",
            "AAPL",
            "Amazon.com Inc",
            "amazon",
            "AMZN",
            "Meta Platforms",
            "meta",
            "Facebook (Meta)",
        ]
    }
)

print(f"Unique values before: {df['company'].n_unique()}")
df

## Resolve entities

The `threshold` parameter controls how aggressively values are merged (0 = nothing, 1 = everything).

In [None]:
result = loclean.resolve_entities(df, "company", threshold=0.8)
result

## Compare before vs. after

In [None]:
unique_before = df["company"].n_unique()
unique_after = result["company_canonical"].n_unique()

print(
    f"Unique values: {unique_before} â†’ {unique_after} ({unique_before - unique_after} merged)"
)
result.select(["company", "company_canonical"])