# 06 - Custom Transformers

This notebook covers how to build your own transformers for Nebula pipelines.

| Part | Topic |
|------|-------|
| **1** | The Transformer Base Class |
| **2** | Narwhals-Native Transformers |
| **3** | Backend-Specific Transformers |
| **4** | Column Selection Helpers |
| **5** | Parameter Tracking & Descriptions |
| **6** | Duck-Typed Transformers |
| **6B** | Using `to_native` and `from_native` Pipeline Keywords |
| **7** | Best Practices |

In [1]:
import narwhals as nw
import polars as pl

from nebula import TransformerPipeline
from nebula.base import Transformer

In [2]:
orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4, 5],
    "customer": ["alice", "bob", "alice", "carol", "bob"],
    "amount": [150.0, 75.0, 200.0, 50.0, 300.0],
    "status": ["shipped", "pending", "shipped", "pending", "delivered"],
})
orders

order_id,customer,amount,status
i64,str,f64,str
1,"""alice""",150.0,"""shipped"""
2,"""bob""",75.0,"""pending"""
3,"""alice""",200.0,"""shipped"""
4,"""carol""",50.0,"""pending"""
5,"""bob""",300.0,"""delivered"""


---
## Part 1: The Transformer Base Class

All Nebula transformers inherit from `Transformer`. The base class provides:

- **Automatic routing** between backends (pandas/polars/spark)
- **Parameter tracking** for visualization and debugging
- **Column selection helpers** for flexible column matching
- **Description support** for documentation

### 1.1 The Routing Logic

When you call `transform(df)`, the base class routes based on:

```
Input: Narwhals DataFrame
├─ Has _transform_nw()? → Use it (nw → nw)
└─ No _transform_nw()? → Convert to native → _select_transform() → Convert back

Input: Native DataFrame (pandas/polars/spark)
├─ Has _transform_nw()? → Wrap in nw → _transform_nw() → Unwrap
└─ No _transform_nw()? → Use backend-specific method
```

**Key insight:** If you implement `_transform_nw()`, your transformer works with ALL backends automatically!

---
## Part 2: Narwhals-Native Transformers

The recommended approach: implement `_transform_nw()` using the Narwhals API.

### 2.1 Minimal Example

In [3]:
class AddProcessedFlag(Transformer):
    """Add a 'processed' column with value True."""
    
    def _transform_nw(self, df):
        return df.with_columns(nw.lit(True).alias("processed"))


# Test it
t = AddProcessedFlag()
result = t.transform(orders)
print(result)

shape: (5, 5)
┌──────────┬──────────┬────────┬───────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ processed │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ bool      │
╞══════════╪══════════╪════════╪═══════════╪═══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ true      │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ true      │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ true      │
│ 4        ┆ carol    ┆ 50.0   ┆ pending   ┆ true      │
│ 5        ┆ bob      ┆ 300.0  ┆ delivered ┆ true      │
└──────────┴──────────┴────────┴───────────┴───────────┘


### 2.2 With Parameters

**Important:** Always use keyword-only arguments (`*`) for transformer parameters. This ensures compatibility with config-driven pipelines.

In [4]:
class MultiplyColumn(Transformer):
    """Multiply a column by a factor."""
    
    def __init__(self, *, column: str, factor: float, output_col: str | None = None):
        super().__init__()  # Always call super().__init__()
        self._column = column
        self._factor = factor
        self._output_col = output_col or column
    
    def _transform_nw(self, df):
        expr = (nw.col(self._column) * self._factor).alias(self._output_col)
        return df.with_columns(expr)


# Test it
t = MultiplyColumn(column="amount", factor=1.1, output_col="amount_with_tax")
result = t.transform(orders)
print(result)

shape: (5, 5)
┌──────────┬──────────┬────────┬───────────┬─────────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ amount_with_tax │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---             │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ f64             │
╞══════════╪══════════╪════════╪═══════════╪═════════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ 165.0           │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ 82.5            │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ 220.0           │
│ 4        ┆ carol    ┆ 50.0   ┆ pending   ┆ 55.0            │
│ 5        ┆ bob      ┆ 300.0  ┆ delivered ┆ 330.0           │
└──────────┴──────────┴────────┴───────────┴─────────────────┘


### 2.3 Using in Pipelines

In [5]:
pipe = TransformerPipeline([
    MultiplyColumn(column="amount", factor=1.1, output_col="amount_with_tax"),
    AddProcessedFlag(),
])

pipe.show(add_params=True)
result = pipe.run(orders)
print(result)

2026-02-09 10:33:20,406 | [INFO]: Starting pipeline 
2026-02-09 10:33:20,407 | [INFO]: Running 'MultiplyColumn' ... 
2026-02-09 10:33:20,408 | [INFO]: Completed 'MultiplyColumn' in 0.0s 
2026-02-09 10:33:20,408 | [INFO]: Running 'AddProcessedFlag' ... 
2026-02-09 10:33:20,409 | [INFO]: Completed 'AddProcessedFlag' in 0.0s 
2026-02-09 10:33:20,409 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (2 transformations)
 - MultiplyColumn -> PARAMS: column="amount", factor=1.1, output_col="amount_with_tax"
 - AddProcessedFlag
shape: (5, 6)
┌──────────┬──────────┬────────┬───────────┬─────────────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ amount_with_tax ┆ processed │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---             ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ f64             ┆ bool      │
╞══════════╪══════════╪════════╪═══════════╪═════════════════╪═══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ 165.0           ┆ true      │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ 82.5            ┆ true      │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ 220.0           ┆ true      │
│ 4        ┆ carol    ┆ 50.0   ┆ pending   ┆ 55.0            ┆ true      │
│ 5        ┆ bob      ┆ 300.0  ┆ delivered ┆ 330.0           ┆ true      │
└──────────┴──────────┴────────┴───────────┴─────────────────┴───────────┘


---
## Part 3: Backend-Specific Transformers

When Narwhals doesn't support what you need, implement backend-specific methods:

- `_transform_pandas(self, df)` → Returns pandas DataFrame
- `_transform_polars(self, df)` → Returns polars DataFrame
- `_transform_spark(self, df)` → Returns Spark DataFrame

The base class routes automatically via `_select_transform()`.

In [6]:
class ReverseRows(Transformer):
    """Reverse row order - uses native APIs since Narwhals doesn't support this."""
    
    def _transform_pandas(self, df):
        return df.iloc[::-1].reset_index(drop=True)
    
    def _transform_polars(self, df):
        return df.reverse()
    
    def _transform_spark(self, df):
        from pyspark.sql import functions as F
        from pyspark.sql.window import Window
        
        # Add row number, sort descending, drop helper column
        w = Window.orderBy(F.monotonically_increasing_id())
        return (
            df.withColumn("_row_num", F.row_number().over(w))
            .orderBy(F.desc("_row_num"))
            .drop("_row_num")
        )


# Test with Polars
t = ReverseRows()
result = t.transform(orders)
print(result)

shape: (5, 4)
┌──────────┬──────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    │
│ ---      ┆ ---      ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       │
╞══════════╪══════════╪════════╪═══════════╡
│ 5        ┆ bob      ┆ 300.0  ┆ delivered │
│ 4        ┆ carol    ┆ 50.0   ┆ pending   │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   │
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   │
└──────────┴──────────┴────────┴───────────┘


In [7]:
# Also works with pandas
orders_pd = orders.to_pandas()
result_pd = t.transform(orders_pd)
print(result_pd)

   order_id customer  amount     status
0         5      bob   300.0  delivered
1         4    carol    50.0    pending
2         3    alice   200.0    shipped
3         2      bob    75.0    pending
4         1    alice   150.0    shipped


### 3.1 Hybrid Approach

You can implement `_transform_nw()` for most cases and override for specific backends:

In [8]:
class DropDuplicates(Transformer):
    """Drop duplicate rows.
    
    Uses Narwhals for pandas/polars, but Spark needs special handling
    because it preserves ordering differently.
    """
    
    def __init__(self, *, columns: list[str] | None = None):
        super().__init__()
        self._columns = columns
    
    def transform(self, df):
        # Check if Spark - use native API
        df_type = type(df).__module__
        if "pyspark" in df_type:
            return self._transform_spark(df)
        
        # Otherwise use Narwhals
        if not isinstance(df, (nw.DataFrame, nw.LazyFrame)):
            df = nw.from_native(df)
            result = self._transform_nw(df)
            return nw.to_native(result)
        return self._transform_nw(df)
    
    def _transform_nw(self, df):
        if self._columns:
            return df.unique(subset=self._columns)
        return df.unique()
    
    def _transform_spark(self, df):
        if self._columns:
            return df.dropDuplicates(self._columns)
        return df.dropDuplicates()


# Test
t = DropDuplicates(columns=["customer"])
result = t.transform(orders)
print(result)

shape: (3, 4)
┌──────────┬──────────┬────────┬─────────┐
│ order_id ┆ customer ┆ amount ┆ status  │
│ ---      ┆ ---      ┆ ---    ┆ ---     │
│ i64      ┆ str      ┆ f64    ┆ str     │
╞══════════╪══════════╪════════╪═════════╡
│ 4        ┆ carol    ┆ 50.0   ┆ pending │
│ 1        ┆ alice    ┆ 150.0  ┆ shipped │
│ 2        ┆ bob      ┆ 75.0   ┆ pending │
└──────────┴──────────┴────────┴─────────┘


---
## Part 4: Column Selection Helpers

The base class provides helpers for flexible column selection (by name, regex, glob, prefix, suffix).

### 4.1 Using `_set_columns_selections()`

In [9]:
class UppercaseColumns(Transformer):
    """Convert string columns to uppercase."""
    
    def __init__(
        self,
        *,
        columns: str | list[str] | None = None,
        regex: str | None = None,
        glob: str | None = None,
        startswith: str | None = None,
        endswith: str | None = None,
    ):
        super().__init__()
        # Register column selection criteria
        self._set_columns_selections(
            columns=columns,
            regex=regex,
            glob=glob,
            startswith=startswith,
            endswith=endswith,
        )
    
    def _transform_nw(self, df):
        # Get matching columns at runtime
        selected = self._get_selected_columns(df)
        
        if not selected:
            return df
        
        exprs = [nw.col(c).str.to_uppercase() for c in selected]
        return df.with_columns(exprs)


# Test with explicit columns
t = UppercaseColumns(columns=["customer", "status"])
result = t.transform(orders)
print(result)

shape: (5, 4)
┌──────────┬──────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    │
│ ---      ┆ ---      ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       │
╞══════════╪══════════╪════════╪═══════════╡
│ 1        ┆ ALICE    ┆ 150.0  ┆ SHIPPED   │
│ 2        ┆ BOB      ┆ 75.0   ┆ PENDING   │
│ 3        ┆ ALICE    ┆ 200.0  ┆ SHIPPED   │
│ 4        ┆ CAROL    ┆ 50.0   ┆ PENDING   │
│ 5        ┆ BOB      ┆ 300.0  ┆ DELIVERED │
└──────────┴──────────┴────────┴───────────┘


In [10]:
# Test with glob pattern
t = UppercaseColumns(glob="*er")  # Matches 'customer', 'order' columns
result = t.transform(orders)
print(f"Selected columns: {t._get_selected_columns(orders)}")
print(result)

Selected columns: ['customer']
shape: (5, 4)
┌──────────┬──────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ status    │
│ ---      ┆ ---      ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str       │
╞══════════╪══════════╪════════╪═══════════╡
│ 1        ┆ ALICE    ┆ 150.0  ┆ shipped   │
│ 2        ┆ BOB      ┆ 75.0   ┆ pending   │
│ 3        ┆ ALICE    ┆ 200.0  ┆ shipped   │
│ 4        ┆ CAROL    ┆ 50.0   ┆ pending   │
│ 5        ┆ BOB      ┆ 300.0  ┆ delivered │
└──────────┴──────────┴────────┴───────────┘


### 4.2 Column Selection Options

| Parameter | Description | Example |
|-----------|-------------|---------|
| `columns` | Explicit list | `["col_a", "col_b"]` |
| `regex` | Regular expression | `"^amount_.*"` |
| `glob` | Shell-style pattern | `"*_id"` |
| `startswith` | Column prefix | `"order"` |
| `endswith` | Column suffix | `"_at"` |
| `allow_excess_columns` | Allow missing columns | `True` |

---
## Part 5: Parameter Tracking & Descriptions

### 5.1 Automatic Parameter Tracking

The `InitParamsStorage` metaclass automatically captures `__init__` parameters:

In [11]:
t = MultiplyColumn(column="amount", factor=1.1, output_col="new_amount")

# Parameters are automatically captured
print("Tracked parameters:")
print(t.transformer_init_parameters)

Tracked parameters:
{'column': 'amount', 'factor': 1.1, 'output_col': 'new_amount'}


In [12]:
# These appear in pipeline.show() and DAG visualization
pipe = TransformerPipeline([
    MultiplyColumn(column="amount", factor=1.1),
])
pipe.show(add_params=True)

*** Pipeline *** (1 transformation)
 - MultiplyColumn -> PARAMS: column="amount", factor=1.1


### 5.2 Adding Descriptions

In [13]:
t = MultiplyColumn(column="amount", factor=1.1)
t.set_description("Apply 10% tax to order amounts")

print(f"Description: {t.get_description()}")

# In pipelines, you can also use tuple syntax:
pipe = TransformerPipeline([
    (MultiplyColumn(column="amount", factor=1.1), "Apply 10% tax to order amounts"),
])
pipe.show()

Description: Apply 10% tax to order amounts
*** Pipeline *** (1 transformation)
 - MultiplyColumn
     Description: Apply 10% tax to order amounts


---
## Part 6: Duck-Typed Transformers

You don't *have* to inherit from `Transformer`. Any object with a valid `transform(df)` method works.

### 6.1 Requirements

A duck-typed transformer must have:
- A `transform` method
- First parameter is positional (the DataFrame)
- Additional parameters must have defaults

In [14]:
class SimpleDuckTransformer:
    """No inheritance needed - just implement transform()."""
    
    def transform(self, df):
        # Works with any backend
        if not isinstance(df, (nw.DataFrame, nw.LazyFrame)):
            df_nw = nw.from_native(df)
            result = df_nw.with_columns(nw.lit("duck").alias("source"))
            return nw.to_native(result)
        return df.with_columns(nw.lit("duck").alias("source"))


# Use in pipeline
pipe = TransformerPipeline([
    SimpleDuckTransformer(),
])

result = pipe.run(orders)
print(result)

2026-02-09 10:33:20,475 | [INFO]: Starting pipeline 
2026-02-09 10:33:20,479 | [INFO]: Running 'SimpleDuckTransformer' ... 
2026-02-09 10:33:20,479 | [INFO]: Completed 'SimpleDuckTransformer' in 0.0s 
2026-02-09 10:33:20,481 | [INFO]: Pipeline completed in 0.0s 


shape: (5, 5)
┌──────────┬──────────┬────────┬───────────┬────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ source │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---    │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ str    │
╞══════════╪══════════╪════════╪═══════════╪════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ duck   │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ duck   │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ duck   │
│ 4        ┆ carol    ┆ 50.0   ┆ pending   ┆ duck   │
│ 5        ┆ bob      ┆ 300.0  ┆ delivered ┆ duck   │
└──────────┴──────────┴────────┴───────────┴────────┘


In [15]:
class DuckWithOptions:
    """Duck-typed with optional parameters (must have defaults)."""
    
    def transform(self, df, prefix="col_", debug=False):
        if debug:
            print(f"Processing with prefix: {prefix}")
        # ... transformation logic
        return df


# Valid - extra params have defaults
pipe = TransformerPipeline([DuckWithOptions()])
pipe.show()

*** Pipeline *** (1 transformation)
 - DuckWithOptions


### 6.2 Validation

Nebula validates duck-typed transformers:

In [16]:
from nebula.pipelines.transformer_type_util import is_duck_typed_transformer


class ValidDuck:
    def transform(self, df): return df

class InvalidDuck:
    def transform(self): return None  # No df parameter!

class AlsoInvalid:
    def transform(self, df, required_param): return df  # No default!

print(f"ValidDuck: {is_duck_typed_transformer(ValidDuck())}")
print(f"InvalidDuck: {is_duck_typed_transformer(InvalidDuck())}")
print(f"AlsoInvalid: {is_duck_typed_transformer(AlsoInvalid())}")

ValidDuck: True
InvalidDuck: False
AlsoInvalid: False


---
## Part 6B: Using `to_native` and `from_native` Pipeline Keywords

When working with custom functions or libraries that expect **native** DataFrames (pandas, Polars, Spark) rather than Narwhals-wrapped DataFrames, you can use the `"to_native"` and `"from_native"` pipeline keywords.

### When to Use These Keywords

1. **Third-party libraries** that only work with native DataFrames (e.g., scikit-learn, plotly)
2. **Custom functions** that use backend-specific features not available in Narwhals
3. **Duck-typed transformers** that don't handle Narwhals wrapping internally

### The Keywords

| Keyword | Effect |
|---------|--------|
| `"to_native"` | Converts the DataFrame from Narwhals to native format (no-op if already native) |
| `"from_native"` | Converts the DataFrame from native to Narwhals format (no-op if already Narwhals) |

### 6B.1 Example: Using a Native-Only Custom Function

Suppose you have a custom function that uses Polars-specific features not available in Narwhals:

In [17]:
# A custom function that expects a native Polars DataFrame
# (uses Polars-specific .pipe() method with side effects)
def add_row_hash(df: pl.DataFrame) -> pl.DataFrame:
    """Add a hash column based on all row values - Polars-only feature."""
    return df.with_columns(
        pl.concat_str(pl.all()).hash().alias("row_hash")
    )


# WITHOUT conversion keywords - this would fail if pipeline uses Narwhals internally
# pipe = TransformerPipeline([
#     add_row_hash,  # Would receive Narwhals DataFrame - might fail!
# ])

# WITH conversion keywords - explicitly convert for the custom function
pipe = TransformerPipeline([
    AddProcessedFlag(),         # Works with Narwhals
    "to_native",                # Convert to native Polars
    add_row_hash,               # Now receives native Polars DataFrame
    "from_native",              # Convert back to Narwhals for remaining steps
    MultiplyColumn(column="amount", factor=1.1, output_col="amount_with_tax"),
])

pipe.show()
result = pipe.run(orders)
print(result)

2026-02-09 10:33:20,503 | [INFO]: Starting pipeline 
2026-02-09 10:33:20,503 | [INFO]: Running 'AddProcessedFlag' ... 
2026-02-09 10:33:20,503 | [INFO]: Completed 'AddProcessedFlag' in 0.0s 
2026-02-09 10:33:20,503 | [INFO]: Running 'add_row_hash' ... 
2026-02-09 10:33:20,503 | [INFO]: Completed 'add_row_hash' in 0.0s 
2026-02-09 10:33:20,503 | [INFO]: Running 'MultiplyColumn' ... 
2026-02-09 10:33:20,503 | [INFO]: Completed 'MultiplyColumn' in 0.0s 
2026-02-09 10:33:20,503 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (3 transformations)
 - AddProcessedFlag
   --> Convert to native DataFrame
 - add_row_hash
   --> Convert to Narwhals DataFrame
 - MultiplyColumn
shape: (5, 7)
┌──────────┬──────────┬────────┬───────────┬───────────┬──────────────────────┬─────────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ processed ┆ row_hash             ┆ amount_with_tax │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---       ┆ ---                  ┆ ---             │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ bool      ┆ u64                  ┆ f64             │
╞══════════╪══════════╪════════╪═══════════╪═══════════╪══════════════════════╪═════════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ true      ┆ 10649755137095888670 ┆ 165.0           │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ true      ┆ 12761887061380610543 ┆ 82.5            │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ true      ┆ 15083726127477530648 ┆ 220.0           │
│ 4        ┆ carol    ┆ 50.0   ┆ pendin

### 6B.2 Example: Wrapping Multiple Native Functions

You can use the keywords to create "native zones" in your pipeline:

In [18]:
# Multiple native-only functions
def add_description(df: pl.DataFrame) -> pl.DataFrame:
    """Add a description column using Polars string interpolation."""
    return df.with_columns(
        pl.format("Order {} for {}", "order_id", "customer").alias("description")
    )

def flag_high_value(df: pl.DataFrame) -> pl.DataFrame:
    """Flag high-value orders using Polars when/then/otherwise."""
    return df.with_columns(
        pl.when(pl.col("amount") > 100)
        .then(pl.lit("high"))
        .otherwise(pl.lit("normal"))
        .alias("value_tier")
    )


# Create a "native zone" for multiple functions
pipe = TransformerPipeline([
    AddProcessedFlag(),         # Narwhals transformer
    "to_native",                # Enter native zone
    add_description,            # Native function 1
    flag_high_value,            # Native function 2  
    "from_native",              # Exit native zone
    MultiplyColumn(column="amount", factor=0.9, output_col="discounted"),  # Back to Narwhals
])

pipe.show()
result = pipe.run(orders)
print(result)

2026-02-09 10:33:20,518 | [INFO]: Starting pipeline 
2026-02-09 10:33:20,518 | [INFO]: Running 'AddProcessedFlag' ... 
2026-02-09 10:33:20,518 | [INFO]: Completed 'AddProcessedFlag' in 0.0s 
2026-02-09 10:33:20,518 | [INFO]: Running 'add_description' ... 
2026-02-09 10:33:20,518 | [INFO]: Completed 'add_description' in 0.0s 
2026-02-09 10:33:20,518 | [INFO]: Running 'flag_high_value' ... 
2026-02-09 10:33:20,518 | [INFO]: Completed 'flag_high_value' in 0.0s 
2026-02-09 10:33:20,518 | [INFO]: Running 'MultiplyColumn' ... 
2026-02-09 10:33:20,518 | [INFO]: Completed 'MultiplyColumn' in 0.0s 
2026-02-09 10:33:20,518 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (4 transformations)
 - AddProcessedFlag
   --> Convert to native DataFrame
 - add_description
 - flag_high_value
   --> Convert to Narwhals DataFrame
 - MultiplyColumn
shape: (5, 8)
┌──────────┬──────────┬────────┬───────────┬───────────┬───────────────────┬────────────┬────────────┐
│ order_id ┆ customer ┆ amount ┆ status    ┆ processed ┆ description       ┆ value_tier ┆ discounted │
│ ---      ┆ ---      ┆ ---    ┆ ---       ┆ ---       ┆ ---               ┆ ---        ┆ ---        │
│ i64      ┆ str      ┆ f64    ┆ str       ┆ bool      ┆ str               ┆ str        ┆ f64        │
╞══════════╪══════════╪════════╪═══════════╪═══════════╪═══════════════════╪════════════╪════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ shipped   ┆ true      ┆ Order 1 for alice ┆ high       ┆ 135.0      │
│ 2        ┆ bob      ┆ 75.0   ┆ pending   ┆ true      ┆ Order 2 for bob   ┆ normal     ┆ 67.5       │
│ 3        ┆ alice    ┆ 200.0  ┆ shipped   ┆ true      ┆ Order 3 for alice ┆ hig

### 6B.3 Safe Conversions (No-Ops)

Both keywords use "safe" conversion functions that are **no-ops** when the DataFrame is already in the target format:

- `"to_native"` on a native DataFrame → returns it unchanged
- `"from_native"` on a Narwhals DataFrame → returns it unchanged

This means you can use them defensively without worrying about double-conversions:

In [19]:
# You can also import the utility functions directly for use in your own code
from nebula.nw_util import safe_to_native, safe_from_native

# These are safe - no double conversion
native_df = orders  # Already native Polars
result1 = safe_to_native(native_df)  # No-op, returns same DataFrame
print(f"safe_to_native on native: {type(result1)}")

nw_df = nw.from_native(orders)
result2 = safe_from_native(nw_df)    # No-op, returns same DataFrame  
print(f"safe_from_native on Narwhals: {type(result2)}")

# Actual conversions
result3 = safe_to_native(nw_df)      # Converts to native
print(f"safe_to_native on Narwhals: {type(result3)}")

result4 = safe_from_native(native_df)  # Wraps in Narwhals
print(f"safe_from_native on native: {type(result4)}")

safe_to_native on native: <class 'polars.dataframe.frame.DataFrame'>
safe_from_native on Narwhals: <class 'narwhals.dataframe.DataFrame'>
safe_to_native on Narwhals: <class 'polars.dataframe.frame.DataFrame'>
safe_from_native on native: <class 'narwhals.dataframe.DataFrame'>


---
## Part 7: Best Practices

### 7.1 Always Call `super().__init__()`

In [20]:
class CorrectTransformer(Transformer):
    def __init__(self, *, value: int):
        super().__init__()  # ← Required for parameter tracking
        self._value = value

### 7.2 Use Keyword-Only Arguments

All parameters after `*` are keyword-only, ensuring config compatibility:

In [21]:
class GoodTransformer(Transformer):
    def __init__(self, *, column: str, value: int):  # ← keyword-only
        super().__init__()
        # ...

# Works with config:
# {"transformer": "GoodTransformer", "params": {"column": "x", "value": 1}}

### 7.3 Prefer Narwhals for Multi-Backend Support

Use `_transform_nw()` whenever possible:

In [22]:
class PortableTransformer(Transformer):
    """Works with pandas, polars, and spark - no extra code!"""
    
    def __init__(self, *, column: str):
        super().__init__()
        self._column = column
    
    def _transform_nw(self, df):
        return df.filter(nw.col(self._column).is_not_null())

### 7.4 Type Hints for Clarity

In [23]:
from typing import Iterable


class WellTypedTransformer(Transformer):
    def __init__(
        self,
        *,
        columns: str | list[str],
        threshold: float = 0.0,
        exclude: Iterable[str] | None = None,
    ):
        super().__init__()
        # ...

### 7.5 Validate in `__init__`, Not `transform`

Fail fast with clear errors:

In [24]:
class ValidatingTransformer(Transformer):
    def __init__(self, *, threshold: float):
        super().__init__()
        
        # Validate early
        if not 0.0 <= threshold <= 1.0:
            raise ValueError(f"threshold must be between 0 and 1, got {threshold}")
        
        self._threshold = threshold


try:
    t = ValidatingTransformer(threshold=1.5)
except ValueError as e:
    print(f"Caught: {e}")

Caught: threshold must be between 0 and 1, got 1.5


---
## Summary

| Pattern | When to Use |
|---------|-------------|
| `_transform_nw()` | **Default** - works across all backends |
| `_transform_pandas/polars/spark()` | Backend-specific features |
| `_set_columns_selections()` | Flexible column matching |
| Duck-typed | Quick prototypes, external libs |

**Checklist for new transformers:**
1. ✅ Inherit from `Transformer`
2. ✅ Call `super().__init__()`
3. ✅ Use keyword-only args (`*`)
4. ✅ Implement `_transform_nw()` if possible
5. ✅ Add type hints
6. ✅ Validate in `__init__`