# Loclean Quick Start

This notebook demonstrates the core features of Loclean:
- Structured extraction with Pydantic
- Data cleaning with DataFrames
- Privacy scrubbing
- Working with different backends (Pandas/Polars)

> **üìö Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)

In [1]:
import loclean
import polars as pl
import pandas as pd

## 1. Structured Extraction with Pydantic

Extract structured data from unstructured text with guaranteed schema compliance:

In [2]:
from pydantic import BaseModel

class Product(BaseModel):
    name: str
    price: int
    color: str

# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(f"Name: {item.name}")
print(f"Price: {item.price}")
print(f"Color: {item.color}")

2026-01-13 16:35:42,928 - loclean.inference.local.llama_cpp - INFO - Using adapter: Phi3Adapter for model: phi-3-mini
2026-01-13 16:35:42,929 - loclean.inference.local.downloader - INFO - Model found at /home/nxank4/.cache/loclean/Phi-3-mini-4k-instruct-q4.gguf
2026-01-13 16:35:42,930 - loclean.inference.local.llama_cpp - INFO - Loading model from /home/nxank4/.cache/loclean/Phi-3-mini-4k-instruct-q4.gguf...


2026-01-13 16:35:49,381 - loclean.inference.local.llama_cpp - INFO - LlamaCppEngine initialized successfully with model: phi-3-mini


Name: red t-shirt
Price: 50
Color: red


## 2. Working with Tabular Data (Polars)

Process entire DataFrames with automatic batch processing:

In [3]:
# Create DataFrame with messy data
df = pl.DataFrame({
    "weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]
})

print("Input Data:")
print(df)

# Clean the entire column
result = loclean.clean(
    df,
    target_col="weight",
    instruction="Convert all weights to kg"
)

# View results
print("\nCleaned Results:")
print(result.select(["weight", "clean_value", "clean_unit"]))

Input Data:
shape: (4, 1)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ weight ‚îÇ
‚îÇ ---    ‚îÇ
‚îÇ str    ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 5kg    ‚îÇ
‚îÇ 3.5 kg ‚îÇ
‚îÇ 5000g  ‚îÇ
‚îÇ 2.2kg  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò


Inference Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00, 2139.95batch/s]


Cleaned Results:
shape: (4, 3)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ weight ‚îÜ clean_value ‚îÜ clean_unit ‚îÇ
‚îÇ ---    ‚îÜ ---         ‚îÜ ---        ‚îÇ
‚îÇ str    ‚îÜ f64         ‚îÜ str        ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ 5kg    ‚îÜ 5.0         ‚îÜ kg         ‚îÇ
‚îÇ 3.5 kg ‚îÜ 3.5         ‚îÜ kg         ‚îÇ
‚îÇ 5000g  ‚îÜ 5.0         ‚îÜ kg         ‚îÇ
‚îÇ 2.2kg  ‚îÜ 2.2         ‚îÜ kg         ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò





## 3. Working with Pandas

Loclean works seamlessly with Pandas:

In [4]:
# Create Pandas DataFrame
df_pandas = pd.DataFrame({
    "description": ["Selling red t-shirt for 50k"]
})

# Extract structured data
result = loclean.extract(df_pandas, schema=Product, target_col="description")
print(f"Result type: {type(result)}")
print(result)

Result type: <class 'pandas.core.frame.DataFrame'>
                   description  \
0  Selling red t-shirt for 50k   

                               description_extracted  
0  {'name': 'red t-shirt', 'price': 50, 'color': ...  


## 4. Privacy Scrubbing

Scrub sensitive PII data locally:

In [5]:
# Text with PII
text = "Contact John Doe at john@example.com or call 555-1234"

# Scrub PII (default: mask mode)
cleaned = loclean.scrub(text, mode="mask")
print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

Original: Contact John Doe at john@example.com or call 555-1234
Cleaned:  Contact [PERSON] at [EMAIL] or call [PHONE]


## 5. Extraction with DataFrames (Polars)

Extract structured data from DataFrame columns:

In [6]:
df = pl.DataFrame({
    "description": [
        "Selling red t-shirt for 50k",
        "Blue jeans available for 30k"
    ]
})

result = loclean.extract(df, schema=Product, target_col="description")

# Show extracted data with expanded struct fields for better readability
print("Extracted Data:")
print(result.with_columns([
    pl.col("description_extracted").struct.field("name").alias("product_name"),
    pl.col("description_extracted").struct.field("price").alias("product_price"),
    pl.col("description_extracted").struct.field("color").alias("product_color"),
]))

# Query extracted data using Polars Struct
# Note: "50k" is extracted as 50, not 50000
filtered = result.filter(
    pl.col("description_extracted").struct.field("price") > 40
)
print("\nProducts with price > 40:")
print(filtered.with_columns([
    pl.col("description_extracted").struct.field("name").alias("product_name"),
    pl.col("description_extracted").struct.field("price").alias("product_price"),
    pl.col("description_extracted").struct.field("color").alias("product_color"),
]))

Extracted Data:
shape: (2, 5)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ description             ‚îÜ description_extracted   ‚îÜ product_name ‚îÜ product_price ‚îÜ product_color ‚îÇ
‚îÇ ---                     ‚îÜ ---                     ‚îÜ ---          ‚îÜ ---           ‚îÜ ---           ‚îÇ
‚îÇ str                     ‚îÜ struct[3]               ‚îÜ str          ‚îÜ i64           ‚îÜ str           ‚îÇ
‚ïû‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï™‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ï°
‚îÇ Selling red t-shirt for ‚

## Best Practices

### Tips for Better Results

1. **Use appropriate backends**: Polars is faster for large datasets, Pandas for compatibility
2. **Batch processing**: DataFrames are automatically batched for efficient inference
3. **Custom instructions**: Provide clear instructions for better extraction/cleaning results
4. **Schema design**: Use Pydantic models with appropriate types for structured extraction
5. **Privacy first**: Always scrub PII before sharing or storing data

## Next Steps

- **Data Cleaning:** See [02-data-cleaning.ipynb](./02-data-cleaning.ipynb) for detailed cleaning examples
- **Privacy Scrubbing:** See [03-privacy-scrubbing.ipynb](./03-privacy-scrubbing.ipynb) for PII removal
- **Structured Extraction:** See [04-structured-extraction.ipynb](./04-structured-extraction.ipynb) for complex schemas
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)