# Loclean Quick Start

This notebook demonstrates the core features of Loclean:
- Structured extraction with Pydantic
- Data cleaning with DataFrames
- Privacy scrubbing
- Working with different backends (Pandas/Polars)

> **üìö Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)

In [1]:
import loclean
import polars as pl
import pandas as pd

## 1. Structured Extraction with Pydantic

Extract structured data from unstructured text with guaranteed schema compliance:

In [2]:
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(f"Name: {item.name}")
print(f"Price: {item.price}")
print(f"Color: {item.color}")

2026-01-11 19:06:59,827 - loclean.inference.local.llama_cpp - INFO - Using adapter: Phi3Adapter for model: phi-3-mini


2026-01-11 19:06:59,828 - loclean.inference.local.downloader - INFO - Model found at /home/nxank4/.cache/loclean/Phi-3-mini-4k-instruct-q4.gguf
2026-01-11 19:06:59,829 - loclean.inference.local.llama_cpp - INFO - Loading model from /home/nxank4/.cache/loclean/Phi-3-mini-4k-instruct-q4.gguf...
2026-01-11 19:07:01,888 - loclean.inference.local.llama_cpp - INFO - LlamaCppEngine initialized successfully with model: phi-3-mini


Name: red t-shirt
Price: 50
Color: red


## 2. Working with Tabular Data (Polars)

Process entire DataFrames with automatic batch processing:

In [3]:
# Create DataFrame with messy data
df = pl.DataFrame({
    "weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]
})

print("Input Data:")
print(df)

# Clean the entire column
result = loclean.clean(
    df,
    target_col="weight",
    instruction="Convert all weights to kg"
)

# View results
print("\nCleaned Results:")
print(result.select(["weight", "clean_value", "clean_unit"]))

Input Data:
shape: (4, 1)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ weight ‚îÇ
‚îÇ ---    ‚îÇ
‚îÇ str    ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 5kg    ‚îÇ
‚îÇ 3.5 kg ‚îÇ
‚îÇ 5000g  ‚îÇ
‚îÇ 2.2kg  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò


Inference Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 2110.87batch/s]


Cleaned Results:
shape: (4, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ weight ‚îÜ clean_value ‚îÜ clean_unit ‚îÇ
‚îÇ ---    ‚îÜ ---         ‚îÜ ---        ‚îÇ
‚îÇ str    ‚îÜ f64         ‚îÜ str        ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 5kg    ‚îÜ 5.0         ‚îÜ kg         ‚îÇ
‚îÇ 3.5 kg ‚îÜ 3.5         ‚îÜ kg         ‚îÇ
‚îÇ 5000g  ‚îÜ 5.0         ‚îÜ kg         ‚îÇ
‚îÇ 2.2kg  ‚îÜ 2.2         ‚îÜ kg         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò





## 3. Working with Pandas

Loclean works seamlessly with Pandas:

In [None]:
# Create Pandas DataFrame
df_pandas = pd.DataFrame({
    "description": ["Selling red t-shirt for 50k"]
})

# Extract structured data
result = loclean.extract(df_pandas, schema=Product, target_col="description")
print(f"Result type: {type(result)}")
print(result)

Result type: <class 'pandas.core.frame.DataFrame'>
                   description  \
0  Selling red t-shirt for 50k   

                               description_extracted  
0  {'name': 'red t-shirt', 'price': 50, 'color': ...  


: 

## 4. Privacy Scrubbing

Scrub sensitive PII data locally:

In [None]:
# Text with PII
text = "Contact John Doe at john@example.com or 555-1234"

# Scrub PII (default: mask mode)
cleaned = loclean.scrub(text, mode="mask")
print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

parse: error parsing grammar: expecting newline or end at _type ws "," ws "value" ws ":" ws string ws "}"
pii_type  ::= "\"person\"" | "\"address\""
string    ::= "\"" ([^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F]{4}))* "\""
ws        ::= [ \t\n\r]*



root      ::= object
object    ::= "{" ws "entities" ws ":" ws array ws ("," ws "reasoning" ws ":" ws string)? ws "}"
array     ::= "[" ws (entity ("," ws entity)*)? ws "]"
entity    ::= "{" ws "type" ws ":" ws pii_type ws "," ws "value" ws ":" ws string ws "}"
pii_type  ::= "\"person\"" | "\"address\""
string    ::= "\"" ([^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F]{4}))* "\""
ws        ::= [ \t\n\r]*


llama_grammar_init_impl: failed to parse grammar


## 5. Extraction with DataFrames (Polars)

Extract structured data from DataFrame columns:

In [None]:
df = pl.DataFrame({
    "description": [
        "Selling red t-shirt for 50k",
        "Blue jeans available for 30k"
    ]
})

result = loclean.extract(df, schema=Product, target_col="description")

# Query extracted data using Polars Struct
filtered = result.filter(
    pl.col("description_extracted").struct.field("price") > 40000
)
print("Products with price > 40k:")
print(filtered)

## Next Steps

- **Data Cleaning:** See [02-data-cleaning.ipynb](./02-data-cleaning.ipynb) for detailed cleaning examples
- **Privacy Scrubbing:** See [03-privacy-scrubbing.ipynb](./03-privacy-scrubbing.ipynb) for PII removal
- **Structured Extraction:** See [04-structured-extraction.ipynb](./04-structured-extraction.ipynb) for complex schemas
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)