# Data Cleaning with Loclean

This notebook demonstrates how to clean and normalize messy data using `loclean.clean()`.

> **ðŸ“š Full Documentation:** [Data Cleaning Guide](https://nxank4.github.io/loclean/getting-started/data-cleaning/)

In [None]:
import loclean
import polars as pl
import pandas as pd

## Basic Usage

Clean messy data in DataFrame columns:

In [None]:
# Create a DataFrame with messy data
df = pl.DataFrame({
    "weight": ["5kg", "3.5 kg", "5000g", "2.2kg"]
})

print("Input Data:")
print(df)

# Clean the weight column
result = loclean.clean(
    df,
    target_col="weight",
    instruction="Extract the numeric value and unit as-is."
)

print("\nCleaned Results:")
print(result.select(["weight", "weight_clean_value", "weight_clean_unit"]))

## Custom Instructions

Provide custom instructions to guide the extraction:

In [None]:
# Extract price with currency
df_price = pl.DataFrame({
    "price": ["$50", "50 USD", "â‚¬45", "100 dollars"]
})

result = loclean.clean(
    df_price,
    target_col="price",
    instruction="Extract the numeric value and currency code (USD, EUR, etc.)"
)

print(result.select(["price", "price_clean_value", "price_clean_unit"]))

## Working with Different Backends

### Pandas

In [None]:
df_pandas = pd.DataFrame({
    "temperature": ["25Â°C", "77F", "298K"]
})

result = loclean.clean(
    df_pandas,
    target_col="temperature",
    instruction="Extract temperature value and unit"
)

print(f"Result type: {type(result)}")
print(result)

### Polars

In [None]:
df_polars = pl.DataFrame({
    "distance": ["5km", "3 miles", "1000m"]
})

result = loclean.clean(
    df_polars,
    target_col="distance",
    instruction="Extract distance value and unit"
)

print(f"Result type: {type(result)}")
print(result)

## Handling Missing Values

`clean()` handles missing values gracefully:

In [None]:
df_with_nulls = pl.DataFrame({
    "weight": ["5kg", None, "3kg", ""]
})

result = loclean.clean(
    df_with_nulls,
    target_col="weight",
    instruction="Extract weight value and unit"
)

print("Note: Missing values result in None for clean_value, clean_unit, and clean_reasoning")
print(result)