# 5. Narwhals Data Validation Patterns

This notebook demonstrates essential patterns for validating data in AI/ML pipelines using Narwhals. These patterns ensure data quality and consistency across both training and inference:

| Task | ✅ Good Pattern (Backend-Agnostic) | ❌ Bad Pattern (Backend-Specific) |
|------|-----------------------------------|----------------------------------|
| DataFrame Creation | `df = nw.from_native(df_pd)` | `df_pd = pd.DataFrame(data)` (without conversion) |
| Column Access | `nw.col("feature")` | `df["feature"]` or `df.feature` |
| Type Casting | `nw.col("x").cast(nw.Float64())` | `df["x"].astype("float64")` |
| Null Checking | `nw.col("x").is_null().sum()` | `df["x"].isnull().sum()` |
| Mean Imputation | `nw.col("x").fill_null(mean_val)` | `df["x"].fillna(df["x"].mean())` |
| String Operations | `nw.col("x").str.to_uppercase()` | `df["x"].str.upper()` |

We'll explore two fundamental ML workflow patterns:

1. **Feature Validation (Eager)**
   - Use `eager_only=True` for immediate validation results
   - Return Python types for ML pipeline decisions
   - Example: Checking numeric features before training

2. **Feature Processing (Lazy)**
   - Use lazy evaluation for transformation chains
   - Let Narwhals optimize the operations
   - Example: Converting features to ML-ready format

The examples show how to handle common ML scenarios (missing values, mixed types, inconsistent categories) using proper Narwhals patterns that work across any DataFrame backend.

## Backend-Agnostic DataFrame Creation

Narwhals provides a consistent interface across different DataFrame backends (Pandas, Polars, etc.). The key pattern for backend-agnostic code is:

1. Create your DataFrame with any supported backend (e.g., Pandas)
2. Convert to Narwhals format using `nw.from_native()`
3. Use Narwhals operations that work across all backends

This pattern ensures your code works regardless of the underlying DataFrame implementation.


In [1]:
import narwhals as nw
from narwhals.typing import FrameT  # Type hint for backend-agnostic DataFrames
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, Any

# Create sample data with common ML data issues
data = {
    # Numeric features with different representations
    'integer_feature': [1, 2, None, 4, 5],           # Has null
    'float_feature': [1.5, 2.5, 3.5, None, 5.5],    # Has null
    'string_number': ['1.0', '2.0', 'bad', '4.0', '5.0'],  # Has invalid value
    
    # Categorical features with inconsistencies
    'category_clean': ['A', 'B', 'A', 'C', 'B'],
    'category_messy': ['a', 'B', None, 'c', 'b'],   # Mixed case + null
    
    # Target variable
    'target': [0, 1, 1, 0, 1]                       # Binary classification
}

# Pattern: Backend-Agnostic Conversion
df_pd = pd.DataFrame(data)           # Create with any backend
df = nw.from_native(df_pd)          # Convert to Narwhals format
df

┌───────────────────────────────────────┐
| Narwhals DataFrame                    |
| Use `.to_native` to see native output |
└───────────────────────────────────────┘

## Pattern 1: Simple ML Type Validation Functions

In ML workflows, we commonly need to validate two types of features:
1. Numeric features (can be converted to float, handle nulls)
2. Categorical features (consistent categories, handle case sensitivity)

Let's create simple validation functions that work across any backend.

In [2]:
@nw.narwhalify(eager_only=True)
def validate_numeric_column(df: FrameT, column: str) -> Dict[str, Any]:
    """Validate if a column can be used as numeric feature.
    
    Common ML checks:
    - Can convert to float
    - Count of nulls
    - Basic statistics
    """
    try:
        # Try float conversion
        stats = df.select([
            nw.col(column)
               .cast(nw.Float64())
               .mean()
               .alias("mean"),
            nw.col(column)
               .is_null()
               .sum()
               .alias("nulls")
        ])
        
        return {
            "valid": True,
            "null_count": stats["nulls"].item(),
            "mean": stats["mean"].item()
        }
    except Exception as e:
        return {
            "valid": False,
            "error": str(e)
        }

@nw.narwhalify(eager_only=True)
def validate_categorical_column(df: FrameT, column: str) -> Dict[str, Any]:
    """Validate if a column can be used as categorical feature.
    
    Common ML checks:
    - Unique categories
    - Null handling
    - Case consistency
    """
    try:
        # Get stats using Narwhals operations only
        stats = df.select([
            nw.col(column)
               .cast(nw.String())
               .n_unique()
               .alias("unique"),
            nw.col(column)
               .is_null()
               .sum()
               .alias("nulls")
        ])
        
        # Get categories using Narwhals operations
        categories = df.select([
            nw.col(column)
               .cast(nw.String())
               .alias(column)
        ]).unique()
        
        return {
            "valid": True,
            "null_count": stats["nulls"].item(),
            "unique_count": stats["unique"].item(),
            "n_categories": categories.select([nw.col(column).count()]).item()
        }
    except Exception as e:
        return {
            "valid": False,
            "error": str(e)
        }

# Test with Pandas backend
print("Testing with Pandas backend:")
print("\nNumeric validation:")
print("integer_feature:", validate_numeric_column(df, "integer_feature"))
print("string_number:", validate_numeric_column(df, "string_number"))

print("\nCategorical validation:")
print("category_clean:", validate_categorical_column(df, "category_clean"))
print("category_messy:", validate_categorical_column(df, "category_messy"))


Testing with Pandas backend:

Numeric validation:
integer_feature: {'valid': True, 'null_count': np.int64(1), 'mean': np.float64(3.0)}
string_number: {'valid': False, 'error': "could not convert string to float: 'bad'"}

Categorical validation:
category_clean: {'valid': True, 'null_count': np.int64(0), 'unique_count': np.int64(3), 'n_categories': np.int64(3)}
category_messy: {'valid': True, 'null_count': np.int64(1), 'unique_count': np.int64(5), 'n_categories': np.int64(5)}


### Next we'll show these same functions working with Polars:

In [3]:
import polars as pl

# Create Polars DataFrame
df_pl = pl.DataFrame(data)
df_pl_nw = nw.from_native(df_pl)

print("Testing with Polars backend:")
print("\nNumeric validation:")
print("integer_feature:", validate_numeric_column(df_pl_nw, "integer_feature"))
print("string_number:", validate_numeric_column(df_pl_nw, "string_number"))

print("\nCategorical validation:")
print("category_clean:", validate_categorical_column(df_pl_nw, "category_clean"))
print("category_messy:", validate_categorical_column(df_pl_nw, "category_messy"))

Testing with Polars backend:

Numeric validation:
integer_feature: {'valid': True, 'null_count': 1, 'mean': 3.0}
string_number: {'valid': False, 'error': 'conversion from `str` to `f64` failed in column \'string_number\' for 1 out of 5 values: ["bad"]'}

Categorical validation:
category_clean: {'valid': True, 'null_count': 0, 'unique_count': 3, 'n_categories': 3}
category_messy: {'valid': True, 'null_count': 1, 'unique_count': 5, 'n_categories': 4}


## Results Analysis

The validation functions work consistently across backends with some notable differences:

1. **Numeric Validation**
   - Both backends detect invalid numeric values ("bad" in string_number)
   - Error messages differ but convey the same information
   - Null counting and mean calculation work identically

2. **Categorical Validation**
   - Both backends count nulls consistently
   - Category counting differs slightly:
     * Pandas counts None as a category (5 categories)
     * Polars excludes None (4 categories)
   - This backend difference is important to note for ML pipelines

## Pattern 2: Feature Processing

Now that we can validate features, let's look at processing them for ML. This uses lazy evaluation since we're transforming data, not validating it.

In [4]:
@nw.narwhalify
def process_numeric_feature(df: FrameT, column: str) -> FrameT:
    """Process a numeric feature for ML.
    
    Common ML transformations:
    - Convert to float
    - Fill nulls with mean
    - Standardize format
    """
    # Get mean first
    mean_val = df.select([
        nw.col(column)
           .cast(nw.Float64())
           .mean()
    ]).item()
    
    # Then use it for filling nulls
    return df.select([
        nw.col(column)
           .cast(nw.Float64())
           .fill_null(mean_val)
           .alias(column)
    ])

@nw.narwhalify
def process_categorical_feature(df: FrameT, column: str) -> FrameT:
    """Process a categorical feature for ML.
    
    Common ML transformations:
    - Standardize case
    - Fill nulls with UNKNOWN
    - Consistent string format
    """
    return df.select([
        nw.col(column)
           .cast(nw.String())
           .str.to_uppercase()
           .fill_null("UNKNOWN")
           .alias(column)
    ])

# Test with both backends
print("Pandas backend:")
print("\nProcessing numeric feature:")
numeric_result = process_numeric_feature(df, "integer_feature")
print(numeric_result)

print("\nProcessing categorical feature:")
categorical_result = process_categorical_feature(df, "category_messy")
print(categorical_result)

print("\nPolars backend:")
print("\nProcessing numeric feature:")
numeric_result_pl = process_numeric_feature(df_pl_nw, "integer_feature")
print(numeric_result_pl)

print("\nProcessing categorical feature:")
categorical_result_pl = process_categorical_feature(df_pl_nw, "category_messy")
print(categorical_result_pl)

Pandas backend:

Processing numeric feature:
   integer_feature
0              1.0
1              2.0
2              3.0
3              4.0
4              5.0

Processing categorical feature:
  category_messy
0              A
1              B
2           NONE
3              C
4              B

Polars backend:

Processing numeric feature:
shape: (5, 1)
┌─────────────────┐
│ integer_feature │
│ ---             │
│ f64             │
╞═════════════════╡
│ 1.0             │
│ 2.0             │
│ 3.0             │
│ 4.0             │
│ 5.0             │
└─────────────────┘

Processing categorical feature:
shape: (5, 1)
┌────────────────┐
│ category_messy │
│ ---            │
│ str            │
╞════════════════╡
│ A              │
│ B              │
│ UNKNOWN        │
│ C              │
│ B              │
└────────────────┘


# Summary: Core Narwhals Patterns

| Pattern | When to Use | Why This Pattern | Example Use Cases |
|---------|-------------|------------------|-------------------|
| **Eager Validation** <br> (`eager_only=True`) | • Need immediate results <br> • Returning Python types <br> • Checking data quality | • Validation needs results now <br> • Can't defer error checking <br> • Must verify before processing | • Type compatibility checks <br> • Null value detection <br> • Category validation |
| **Lazy Transformation** <br> (default) | • Chaining operations <br> • Data transformations <br> • Feature engineering | • Let Narwhals optimize <br> • Better memory usage <br> • More efficient pipelines | • Type conversions <br> • Missing value imputation <br> • Feature standardization |

### Typical Validation Workflows

1. **Data Quality Validation (Eager)**
   - Use when immediate validation results needed
   - Return Python types for pipeline decisions
   - Example: Checking numeric features before training

2. **Data Transformation (Lazy)**
   - Use for transformation chains
   - Let Narwhals optimize operations
   - Example: Converting features to ML-ready format

These patterns ensure your data validation code:
- Works consistently across DataFrame backends
- Uses appropriate evaluation strategies
- Follows best practices for validation
- Maintains code clarity and purpose
