# Loclean Quick Start

This notebook demonstrates the core features of Loclean:
- Structured extraction with Pydantic
- Data cleaning with DataFrames
- Privacy scrubbing
- Working with different backends (Pandas/Polars)

> **ðŸ“š Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)

In [None]:
import loclean
import polars as pl
import pandas as pd

## 1. Structured Extraction with Pydantic

Extract structured data from unstructured text with guaranteed schema compliance:

In [None]:
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(f"Name: {item.name}")
print(f"Price: {item.price}")
print(f"Color: {item.color}")

## 2. Working with Tabular Data (Polars)

Process entire DataFrames with automatic batch processing:

In [None]:
# Create DataFrame with messy data
df = pl.DataFrame({
    "weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]
})

print("Input Data:")
print(df)

# Clean the entire column
result = loclean.clean(
    df,
    target_col="weight",
    instruction="Convert all weights to kg"
)

# View results
print("\nCleaned Results:")
print(result.select(["weight", "weight_clean_value", "weight_clean_unit"]))

## 3. Working with Pandas

Loclean works seamlessly with Pandas:

In [None]:
# Create Pandas DataFrame
df_pandas = pd.DataFrame({
    "description": ["Selling red t-shirt for 50k"]
})

# Extract structured data
result = loclean.extract(df_pandas, schema=Product, target_col="description")
print(f"Result type: {type(result)}")
print(result)

## 4. Privacy Scrubbing

Scrub sensitive PII data locally:

In [None]:
# Text with PII
text = "Contact John Doe at john@example.com or 555-1234"

# Scrub PII (default: mask mode)
cleaned = loclean.scrub(text, mode="mask")
print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

## 5. Extraction with DataFrames (Polars)

Extract structured data from DataFrame columns:

In [None]:
df = pl.DataFrame({
    "description": [
        "Selling red t-shirt for 50k",
        "Blue jeans available for 30k"
    ]
})

result = loclean.extract(df, schema=Product, target_col="description")

# Query extracted data using Polars Struct
filtered = result.filter(
    pl.col("description_extracted").struct.field("price") > 40000
)
print("Products with price > 40k:")
print(filtered)

## Next Steps

- **Data Cleaning:** See [02-data-cleaning.ipynb](./02-data-cleaning.ipynb) for detailed cleaning examples
- **Privacy Scrubbing:** See [03-privacy-scrubbing.ipynb](./03-privacy-scrubbing.ipynb) for PII removal
- **Structured Extraction:** See [04-structured-extraction.ipynb](./04-structured-extraction.ipynb) for complex schemas
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)