# Data Cleaning with Loclean

This notebook demonstrates how to clean and normalize messy data using `loclean.clean()`.

> **⚡ Zero Setup:** Loclean auto-starts the Ollama daemon and auto-pulls models on first use.

> **📚 Full Documentation:** [Data Cleaning Guide](https://nxank4.github.io/loclean/getting-started/data-cleaning/)

In [None]:
!pip install loclean

In [1]:
import pandas as pd
import polars as pl

import loclean

## Basic Usage

Clean messy data in DataFrame columns:

In [None]:
# Create a DataFrame with messy data
df = pl.DataFrame({"weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]})
print("Input Data:")
print(df)
# Clean the weight column
# Instruction: Extract values as-is (no unit conversion)
# Note: "5000g" stays as 5000.0 g, not converted to kg
result = loclean.clean(
    df, target_col="weight", instruction="Extract the numeric value and unit as-is."
)
print("\nCleaned Results:")
print(result.select(["weight", "clean_value", "clean_unit"]))
print(
    "\nNote: Units are preserved as-is. See next section for unit conversion examples."
)

## Custom Instructions

Provide custom instructions to guide the extraction. Different instructions produce different results:

In [3]:
# Example 1: Unit Conversion
# Convert all weights to the same unit (kg)
df_weight = pl.DataFrame({"weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]})

result_converted = loclean.clean(
    df_weight, target_col="weight", instruction="Convert all weights to kg"
)

print("With unit conversion (all to kg):")
print(result_converted.select(["weight", "clean_value", "clean_unit"]))

print("\n" + "=" * 60 + "\n")

# Example 2: Extract price with currency
df_price = pl.DataFrame({"price": ["$50", "50 USD", "€45", "100 dollars"]})

result = loclean.clean(
    df_price,
    target_col="price",
    instruction="Extract the numeric value and currency code (USD, EUR, etc.)",
)

print("Extract price with currency:")
print(result.select(["price", "clean_value", "clean_unit"]))

Inference Batches: 100%|██████████| 1/1 [00:00<00:00, 2566.89batch/s]


With unit conversion (all to kg):
shape: (4, 3)
┌────────┬─────────────┬────────────┐
│ weight ┆ clean_value ┆ clean_unit │
│ ---    ┆ ---         ┆ ---        │
│ str    ┆ f64         ┆ str        │
╞════════╪═════════════╪════════════╡
│ 5kg    ┆ 5.0         ┆ kg         │
│ 3.5 kg ┆ 3.5         ┆ kg         │
│ 5000g  ┆ 5.0         ┆ kg         │
│ 2.2kg  ┆ 2.2         ┆ kg         │
└────────┴─────────────┴────────────┘




Inference Batches: 100%|██████████| 1/1 [00:00<00:00, 2634.61batch/s]

Extract price with currency:
shape: (4, 3)
┌─────────────┬─────────────┬────────────┐
│ price       ┆ clean_value ┆ clean_unit │
│ ---         ┆ ---         ┆ ---        │
│ str         ┆ f64         ┆ str        │
╞═════════════╪═════════════╪════════════╡
│ $50         ┆ 50.0        ┆ USD        │
│ 50 USD      ┆ 50.0        ┆ USD        │
│ €45         ┆ 45.0        ┆ EUR        │
│ 100 dollars ┆ 100.0       ┆ USD        │
└─────────────┴─────────────┴────────────┘





## Working with Different Backends

### Pandas

In [4]:
# Clean with Pandas DataFrame
df_pandas = pd.DataFrame({"temperature": ["25°C", "77F", "298K"]})

result = loclean.clean(
    df_pandas,
    target_col="temperature",
    instruction="Extract temperature value and unit",
)

print(f"Result type: {type(result)}")
# Pandas: use column selection with list
print(result[["temperature", "clean_value", "clean_unit"]])

Inference Batches: 100%|██████████| 1/1 [00:00<00:00, 2989.53batch/s]

Result type: <class 'pandas.core.frame.DataFrame'>
  temperature  clean_value clean_unit
0        25°C           25         °C
1         77F           77          F
2        298K          298          K





### Polars

In [5]:
# Clean with Polars DataFrame
df_polars = pl.DataFrame({"distance": ["5km", "3 miles", "1000m"]})

result = loclean.clean(
    df_polars, target_col="distance", instruction="Extract distance value and unit"
)

print(f"Result type: {type(result)}")
print(result.select(["distance", "clean_value", "clean_unit"]))

Inference Batches: 100%|██████████| 1/1 [00:00<00:00, 4249.55batch/s]


Result type: <class 'polars.dataframe.frame.DataFrame'>
shape: (3, 3)
┌──────────┬─────────────┬────────────┐
│ distance ┆ clean_value ┆ clean_unit │
│ ---      ┆ ---         ┆ ---        │
│ str      ┆ f64         ┆ str        │
╞══════════╪═════════════╪════════════╡
│ 5km      ┆ 5.0         ┆ km         │
│ 3 miles  ┆ 3.0         ┆ miles      │
│ 1000m    ┆ 1000.0      ┆ m          │
└──────────┴─────────────┴────────────┘


## Handling Missing Values

`clean()` handles missing values gracefully. `None` and empty strings result in `None` for all output columns:

In [None]:
df_with_nulls = pl.DataFrame({"weight": ["5kg", None, "3kg", ""]})
result = loclean.clean(
    df_with_nulls, target_col="weight", instruction="Extract weight value and unit"
)
print(
    "Note: Missing values (None, empty strings) result in None "
    "for clean_value, clean_unit, and clean_reasoning"
)
print(result.select(["weight", "clean_value", "clean_unit"]))

Inference Batches: 100%|██████████| 1/1 [00:00<00:00, 4957.81batch/s]

Note: Missing values (None, empty strings) result in None for clean_value, clean_unit, and clean_reasoning
shape: (4, 3)
┌────────┬─────────────┬────────────┐
│ weight ┆ clean_value ┆ clean_unit │
│ ---    ┆ ---         ┆ ---        │
│ str    ┆ f64         ┆ str        │
╞════════╪═════════════╪════════════╡
│ 5kg    ┆ 5.0         ┆ kg         │
│ null   ┆ null        ┆ null       │
│ 3kg    ┆ 3.0         ┆ kg         │
│        ┆ null        ┆ null       │
└────────┴─────────────┴────────────┘





## Best Practices

### Tips for Better Cleaning

1. **Be specific with instructions**: Clear instructions help the LLM understand your requirements
2. **Unit conversion**: Use instructions like "Convert all weights to kg" for standardization
3. **Handle missing values**: `clean()` gracefully handles `None` and empty strings
4. **Backend choice**: Polars is faster for large datasets, Pandas for compatibility
5. **Batch processing**: DataFrames are automatically batched for efficient inference

## Next Steps

- **Quick Start:** See [01-quick-start.ipynb](./01-quick-start.ipynb) for an overview of all features
- **Privacy Scrubbing:** See [03-privacy-scrubbing.ipynb](./03-privacy-scrubbing.ipynb) for PII removal
- **Structured Extraction:** See [04-structured-extraction.ipynb](./04-structured-extraction.ipynb) for complex schemas
- **Full Documentation:** [https://nxank4.github.io/loclean](https://nxank4.github.io/loclean)