# Structured Extraction with Loclean

This notebook demonstrates how to extract structured data from unstructured text with 100% schema compliance using Pydantic schemas via Ollama.

> **⚡ Zero Setup:** Loclean auto-starts the Ollama daemon and auto-pulls models on first use.

> **📚 Full Documentation:** [Structured Extraction Guide](https://nxank4.github.io/loclean/guides/extraction/)

In [None]:
!pip install loclean

In [12]:
from typing import List, Optional

import polars as pl
from pydantic import BaseModel

import loclean

## Basic Example

Extract structured data from unstructured text:

In [13]:
class Product(BaseModel):
    name: str
    price: int
    color: str


# Extract from text
item = loclean.extract("Selling red t-shirt for 50k", schema=Product)
print(f"Name: {item.name}")
print(f"Price: {item.price}")
print(f"Color: {item.color}")

Name: red t-shirt
Price: 50
Color: red


## Working with DataFrames

Extract structured data from DataFrame columns:

In [14]:
df = pl.DataFrame(
    {"description": ["Selling red t-shirt for 50k", "Blue jeans available for 30k"]}
)

result = loclean.extract(df, schema=Product, target_col="description")

# Show extracted data with expanded struct fields for better readability
print("Extracted Data:")
print(
    result.with_columns(
        [
            pl.col("description_extracted").struct.field("name").alias("product_name"),
            pl.col("description_extracted")
            .struct.field("price")
            .alias("product_price"),
            pl.col("description_extracted")
            .struct.field("color")
            .alias("product_color"),
        ]
    )
)

# Query extracted data using Polars Struct
# Note: "50k" is extracted as 50, not 50000
filtered = result.filter(pl.col("description_extracted").struct.field("price") > 40)
print("\nProducts with price > 40:")
print(
    filtered.with_columns(
        [
            pl.col("description_extracted").struct.field("name").alias("product_name"),
            pl.col("description_extracted")
            .struct.field("price")
            .alias("product_price"),
            pl.col("description_extracted")
            .struct.field("color")
            .alias("product_color"),
        ]
    )
)

Extracted Data:
shape: (2, 5)
┌─────────────────────────┬─────────────────────────┬──────────────┬───────────────┬───────────────┐
│ description             ┆ description_extracted   ┆ product_name ┆ product_price ┆ product_color │
│ ---                     ┆ ---                     ┆ ---          ┆ ---           ┆ ---           │
│ str                     ┆ struct[3]               ┆ str          ┆ i64           ┆ str           │
╞═════════════════════════╪═════════════════════════╪══════════════╪═══════════════╪═══════════════╡
│ Selling red t-shirt for ┆ {"red                   ┆ red t-shirt  ┆ 50            ┆ red           │
│ 50k                     ┆ t-shirt",50,"red"}      ┆              ┆               ┆               │
│ Blue jeans available    ┆ {"Blue                  ┆ Blue jeans   ┆ 30            ┆ Blue          │
│ for 30k                 ┆ jeans",30,"Blue"}       ┆              ┆               ┆               │
└─────────────────────────┴─────────────────────────┴────────

## Advanced Features

### Nested Schemas

Extract nested data structures using nested Pydantic models:

In [None]:
class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str


class Person(BaseModel):
    name: str
    age: int
    email: str
    address: Address  # Nested schema
    phone_numbers: List[str]  # List of strings
    notes: Optional[str] = None  # Optional field


text = """
John Doe, age 35, email: john@example.com
Lives at 123 Main St, New York, NY 10001
Phones: 555-1234, 555-5678
Notes: Preferred contact method is email
"""

person = loclean.extract(text, schema=Person)
print(f"Name: {person.name}")
print(f"Age: {person.age}")
print(f"Email: {person.email}")
print(
    f"Address: {person.address.street}, {person.address.city}, "
    f"{person.address.state} {person.address.zip_code}"
)
print(f"Phone Numbers: {person.phone_numbers}")
print(f"Notes: {person.notes}")

Name: John Doe
Age: 35
Email: john@example.com
Address: 123 Main St, New York, NY 10001
Phone Numbers: ['555-1234', '555-5678']
Notes: Preferred contact method is email


### Custom Instructions

Provide custom instructions to guide the extraction:

In [None]:
# Custom instruction to extract price in actual currency units
class ProductWithPrice(BaseModel):
    name: str
    price: int  # Price in actual currency units (not thousands)
    color: str


text = "Selling red t-shirt for 50k"
item = loclean.extract(
    text,
    schema=ProductWithPrice,
    instruction=(
        "Extract the product name (e.g., 'red t-shirt'), "
        "price in actual currency units ('50k' means 50000, not 50), "
        "and color."
    ),
)
print(f"Name: {item.name}")
print(f"Price: {item.price}")  # Should be 50000 with custom instruction
print(f"Color: {item.color}")

Name: red t-shirt
Price: 50000
Color: red


### Output Types for DataFrames

Choose between structured dict (default, faster) or Pydantic instances:

In [None]:
df = pl.DataFrame({"description": ["Selling red t-shirt for 50k"]})

# Default: output_type="dict" (Polars Struct - faster, vectorized)
result_dict = loclean.extract(
    df, schema=Product, target_col="description", output_type="dict"
)
print("Output type: dict (Polars Struct)")
print(f"Type of first element: {type(result_dict['description_extracted'][0])}")
print(f"Value: {result_dict['description_extracted'][0]}")
print(result_dict)

print("\n" + "=" * 60 + "\n")

# Alternative: output_type="pydantic" (Pydantic instances -
# slower, breaks vectorization)
result_pydantic = loclean.extract(
    df, schema=Product, target_col="description", output_type="pydantic"
)
print("Output type: pydantic (Pydantic model instances)")
print(f"Type of first element: {type(result_pydantic['description_extracted'][0])}")
# Access the actual Pydantic model instance
if hasattr(result_pydantic["description_extracted"][0], "name"):
    attr_name = result_pydantic["description_extracted"][0].name
    print(f"Accessing model attribute: name = {attr_name}")
print(result_pydantic)

Output type: dict (Polars Struct)
Type of first element: <class 'dict'>
Value: {'name': 'red t-shirt', 'price': 50, 'color': 'red'}
shape: (1, 2)
┌─────────────────────────────┬──────────────────────────┐
│ description                 ┆ description_extracted    │
│ ---                         ┆ ---                      │
│ str                         ┆ struct[3]                │
╞═════════════════════════════╪══════════════════════════╡
│ Selling red t-shirt for 50k ┆ {"red t-shirt",50,"red"} │
└─────────────────────────────┴──────────────────────────┘


Output type: pydantic (Pydantic model instances)
Type of first element: <class 'dict'>
shape: (1, 2)
┌─────────────────────────────┬──────────────────────────┐
│ description                 ┆ description_extracted    │
│ ---                         ┆ ---                      │
│ str                         ┆ struct[3]                │
╞═════════════════════════════╪══════════════════════════╡
│ Selling red t-shirt for 50k ┆ {"red t-shi

### Optional Fields

Optional fields are handled gracefully - they can be None if not found:

In [None]:
class ProductWithOptional(BaseModel):
    name: str
    price: int
    color: str
    discount: Optional[int] = None  # Optional field
    description: Optional[str] = None  # Optional field


# Text without optional fields
text1 = "Selling red t-shirt for 50k"
item1 = loclean.extract(text1, schema=ProductWithOptional)
print("Without optional fields:")
print(
    f"Name: {item1.name}, Price: {item1.price}, "
    f"Discount: {item1.discount}, Description: {item1.description}"
)

# Text with optional fields
text2 = "Selling red t-shirt for 50k, 10% discount, premium quality"
item2 = loclean.extract(text2, schema=ProductWithOptional)
print("\nWith optional fields:")
print(
    f"Name: {item2.name}, Price: {item2.price}, "
    f"Discount: {item2.discount}, Description: {item2.description}"
)

Without optional fields:
Name: red t-shirt, Price: 50, Discount: None, Description: None

With optional fields:
Name: red t-shirt, Price: 50, Discount: 5, Description: premium quality


## Best Practices

### Tips for Better Extraction

1. **Be specific with instructions**: Custom instructions help guide the LLM
2. **Use appropriate types**: Choose `int`, `float`, `str`, `bool` based on your data
3. **Handle optional fields**: Use `Optional[T]` for fields that may not always be present
4. **Nested schemas**: Break down complex data into nested models for better structure
5. **Output types**: Use `"dict"` (default) for performance, `"pydantic"` when you need model methods

## Next Steps

- **Quick Start:** See [01-quick-start.ipynb](./01-quick-start.ipynb) for an overview of all features
- **Data Cleaning:** See [02-data-cleaning.ipynb](./02-data-cleaning.ipynb) for data normalization
- **Privacy Scrubbing:** See [03-privacy-scrubbing.ipynb](./03-privacy-scrubbing.ipynb) for PII removal
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)