# 05 - Configuration-Driven Pipelines

Nebula pipelines can be defined entirely in configuration (JSON/YAML), enabling:

- **Separation of concerns** - Data engineers define configs, code stays stable
- **Environment flexibility** - Different configs for dev/prod/regions
- **No-code changes** - Adjust pipelines without redeploying

| Part | Topic |
|------|-------|
| **1** | Basic Configuration |
| **2** | Transformer Options |
| **3** | Custom Transformers & Functions |
| **4** | Lazy Parameters (`__ns__`) |
| **5** | Loops (Dynamic Expansion) |
| **6** | Full Examples |

In [1]:
import polars as pl

from nebula.base import Transformer
from nebula.pipelines.pipeline_loader import load_pipeline
from nebula.storage import nebula_storage as ns

In [2]:
orders = pl.DataFrame({
    "order_id": [1, 2, 3, 4, 5],
    "customer": ["alice", "bob", "alice", "carol", "bob"],
    "amount": [150.0, 75.0, 200.0, 50.0, 300.0],
    "region": ["US", "EU", "US", "APAC", "EU"],
})
orders

order_id,customer,amount,region
i64,str,f64,str
1,"""alice""",150.0,"""US"""
2,"""bob""",75.0,"""EU"""
3,"""alice""",200.0,"""US"""
4,"""carol""",50.0,"""APAC"""
5,"""bob""",300.0,"""EU"""


---
## Part 1: Basic Configuration

The `load_pipeline` function accepts a Python dict (which can come from JSON or YAML files).

### 1.1 Minimal Pipeline

In [3]:
config = {
    "pipeline": [
        {"transformer": "SelectColumns", "params": {"columns": ["order_id", "amount"]}},
        {"transformer": "AssertNotEmpty"},
    ]
}

pipe = load_pipeline(config)
pipe.show(add_params=True)

*** Pipeline *** (2 transformations)
 - SelectColumns -> PARAMS: columns=['order_id', 'amount']
 - AssertNotEmpty


In [4]:
result = pipe.run(orders)
print(result)

2026-02-09 09:50:33,034 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,035 | [INFO]: Running 'SelectColumns' ... 
2026-02-09 09:50:33,043 | [INFO]: Completed 'SelectColumns' in 0.0s 
2026-02-09 09:50:33,044 | [INFO]: Running 'AssertNotEmpty' ... 
2026-02-09 09:50:33,044 | [INFO]: Completed 'AssertNotEmpty' in 0.0s 
2026-02-09 09:50:33,045 | [INFO]: Pipeline completed in 0.0s 


shape: (5, 2)
┌──────────┬────────┐
│ order_id ┆ amount │
│ ---      ┆ ---    │
│ i64      ┆ f64    │
╞══════════╪════════╡
│ 1        ┆ 150.0  │
│ 2        ┆ 75.0   │
│ 3        ┆ 200.0  │
│ 4        ┆ 50.0   │
│ 5        ┆ 300.0  │
└──────────┴────────┘


### 1.2 Loading from Files

In practice, configs come from files:

```python
import json
import yaml  # pip install pyyaml

# From JSON
with open("pipeline.json") as f:
    config = json.load(f)

# From YAML
with open("pipeline.yaml") as f:
    config = yaml.safe_load(f)

pipe = load_pipeline(config)
```

### 1.3 Pipeline-Level Options

In [5]:
config = {
    "name": "Order Processing",
    "df_input_name": "Raw Orders",
    "df_output_name": "Processed Orders",
    "pipeline": [
        {"transformer": "Filter", "params": {
            "input_col": "amount",
            "perform": "keep",
            "operator": "gt",
            "value": 100
        }},
    ]
}

pipe = load_pipeline(config)
pipe.show()

*** Order Processing *** (1 transformation)
 - Filter


---
## Part 2: Transformer Options

Each transformer entry supports these keys:

| Key | Required | Description |
|-----|----------|-------------|
| `transformer` | Yes | Transformer class name |
| `params` | No | Dict of constructor parameters |
| `description` | No | Human-readable description |
| `lazy` | No | If `true`, wrap in LazyWrapper |
| `skip` | No | If `true`, skip this transformer |
| `perform` | No | If `false`, skip this transformer |

In [6]:
config = {
    "pipeline": [
        {
            "transformer": "Filter",
            "description": "Keep only high-value orders",
            "params": {
                "input_col": "amount",
                "perform": "keep",
                "operator": "gt",
                "value": 100
            }
        },
        {
            "transformer": "DropColumns",
            "skip": True,  # This transformer is skipped
            "params": {"columns": "region"}
        },
        {
            "transformer": "SelectColumns",
            "perform": False,  # Same as skip: True
            "params": {"columns": ["order_id"]}
        },
    ]
}

pipe = load_pipeline(config)
pipe.show()

# Only Filter runs - DropColumns and SelectColumns are skipped
result = pipe.run(orders)
print(f"\nColumns: {result.columns}")

2026-02-09 09:50:33,056 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,056 | [INFO]: Running 'Filter' ... 
2026-02-09 09:50:33,056 | [INFO]: Completed 'Filter' in 0.0s 
2026-02-09 09:50:33,056 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (1 transformation)
 - Filter
     Description: Keep only high-value orders

Columns: ['order_id', 'customer', 'amount', 'region']


### 2.1 Feature Flags with Skip/Perform

Use environment variables or config flags to conditionally enable transformers:

In [7]:
# Simulate feature flags
ENABLE_EXPENSIVE_VALIDATION = False
ENABLE_DISCOUNT = True

config = {
    "pipeline": [
        {
            "transformer": "AssertCount",
            "description": "Expensive row count validation",
            "perform": ENABLE_EXPENSIVE_VALIDATION,
            "params": {"min_count": 1}
        },
        {
            "transformer": "AddLiterals",
            "description": "Add discount flag",
            "perform": ENABLE_DISCOUNT,
            "params": {
                "data": [{"alias": "has_discount", "value": True}]
            }
        },
    ]
}

pipe = load_pipeline(config)
pipe.show()

*** Pipeline *** (1 transformation)
 - AddLiterals
     Description: Add discount flag


### 2.2 Storage Keywords in Config

In [8]:
ns.clear()

config = {
    "pipeline": [
        {"transformer": "Filter", "params": {
            "input_col": "amount", "perform": "keep", "operator": "gt", "value": 100
        }},
        {"store": "filtered_orders"},
        {"transformer": "SelectColumns", "params": {"columns": ["order_id", "amount"]}},
    ]
}

pipe = load_pipeline(config)
pipe.show()

result = pipe.run(orders)
print(f"\nStored keys: {ns.list_keys()}")

2026-02-09 09:50:33,066 | [INFO]: Nebula Storage: clear. 
2026-02-09 09:50:33,066 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2026-02-09 09:50:33,066 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,066 | [INFO]: Running 'Filter' ... 
2026-02-09 09:50:33,066 | [INFO]: Completed 'Filter' in 0.0s 
2026-02-09 09:50:33,066 | [INFO]:    --> Store df with key "filtered_orders" 
2026-02-09 09:50:33,066 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "filtered_orders". 
2026-02-09 09:50:33,066 | [INFO]: Running 'SelectColumns' ... 
2026-02-09 09:50:33,066 | [INFO]: Completed 'SelectColumns' in 0.0s 
2026-02-09 09:50:33,066 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (2 transformations)
 - Filter
   --> Store df with key "filtered_orders"
 - SelectColumns

Stored keys: ['filtered_orders']


---
## Part 3: Custom Transformers & Functions

To use transformers not in Nebula's core, pass them via `extra_transformers`. For split/branch functions, use `extra_functions`.

**Best practice:** Use `dict[str, T]` where the **key is the contract** with the config. This is refactor-safe - you can rename Python classes/functions without breaking configs.

### 3.1 Extra Transformers

In [9]:
# Define custom transformers
class AddProcessedFlag(Transformer):
    def _transform_nw(self, df):
        import narwhals as nw
        return df.with_columns(nw.lit(True).alias("processed"))


class DoubleAmount(Transformer):
    def _transform_nw(self, df):
        import narwhals as nw
        return df.with_columns((nw.col("amount") * 2).alias("amount"))


# Config references transformers by name (the dict key)
config = {
    "pipeline": [
        {"transformer": "AddProcessedFlag"},
        {"transformer": "DoubleAmount"},
    ]
}

# Pass custom transformers as dict[str, class]
# Keys are the names used in config - decoupled from Python class names
pipe = load_pipeline(
    config,
    extra_transformers={
        "AddProcessedFlag": AddProcessedFlag,
        "DoubleAmount": DoubleAmount,
    }
)

result = pipe.run(orders.head(2))
print(result)

2026-02-09 09:50:33,080 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,081 | [INFO]: Running 'AddProcessedFlag' ... 
2026-02-09 09:50:33,081 | [INFO]: Completed 'AddProcessedFlag' in 0.0s 
2026-02-09 09:50:33,084 | [INFO]: Running 'DoubleAmount' ... 
2026-02-09 09:50:33,084 | [INFO]: Completed 'DoubleAmount' in 0.0s 
2026-02-09 09:50:33,084 | [INFO]: Pipeline completed in 0.0s 


shape: (2, 5)
┌──────────┬──────────┬────────┬────────┬───────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ processed │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---       │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ bool      │
╞══════════╪══════════╪════════╪════════╪═══════════╡
│ 1        ┆ alice    ┆ 300.0  ┆ US     ┆ true      │
│ 2        ┆ bob      ┆ 150.0  ┆ EU     ┆ true      │
└──────────┴──────────┴────────┴────────┴───────────┘


**Why dict keys matter for refactoring:**

```python
# If you rename DoubleAmount → MultiplyAmount in Python:
extra_transformers={
    "DoubleAmount": MultiplyAmount,  # Key stays same, config doesn't break
}
```

### 3.2 Using Modules

For production, organize transformers in modules:

```python
# my_transformers.py
__all__ = ["AddProcessedFlag", "DoubleAmount"]

class AddProcessedFlag(Transformer): ...
class DoubleAmount(Transformer): ...
```

```python
# Load from module
import my_transformers

pipe = load_pipeline(config, extra_transformers=[my_transformers])
```

### 3.3 Extra Functions (for Split/Branch)

In [10]:
def split_by_region(df):
    """Split DataFrame by region."""
    return {
        "us": df.filter(pl.col("region") == "US"),
        "international": df.filter(pl.col("region") != "US"),
    }


config = {
    "split_function": "split_by_region",  # References dict key
    "pipeline": {
        "us": [
            {"transformer": "AddLiterals", "params": {
                "data": [{"alias": "tax_rate", "value": 0.08}]
            }}
        ],
        "international": [
            {"transformer": "AddLiterals", "params": {
                "data": [{"alias": "tax_rate", "value": 0.20}]
            }}
        ],
    }
}

pipe = load_pipeline(
    config,
    extra_functions={
        "split_by_region": split_by_region,  # Key matches config reference
    }
)

pipe.show()
result = pipe.run(orders)
print(result)

2026-02-09 09:50:33,089 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,089 | [INFO]: Entering split 
2026-02-09 09:50:33,097 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,097 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,097 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,097 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,097 | [INFO]: Pipeline completed in 0.0s 


*** Pipeline *** (2 transformations)
------ SPLIT ------ (function: split_by_region)
**SPLIT <<< international >>> (1 transformation):
     - AddLiterals
**SPLIT <<< us >>> (1 transformation):
     - AddLiterals
<<< Append DFs >>>
shape: (5, 5)
┌──────────┬──────────┬────────┬────────┬──────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ tax_rate │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---      │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ f64      │
╞══════════╪══════════╪════════╪════════╪══════════╡
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ 0.2      │
│ 4        ┆ carol    ┆ 50.0   ┆ APAC   ┆ 0.2      │
│ 5        ┆ bob      ┆ 300.0  ┆ EU     ┆ 0.2      │
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ 0.08     │
│ 3        ┆ alice    ┆ 200.0  ┆ US     ┆ 0.08     │
└──────────┴──────────┴────────┴────────┴──────────┘


---
## Part 4: Lazy Parameters (`__ns__`)

For runtime parameter resolution, use `lazy: true` with the `__ns__` prefix to reference values from nebula storage.

See notebook 04 for `LazyWrapper` details.

In [11]:
ns.clear()
ns.set("computed_label", "premium_customer")

config = {
    "pipeline": [
        {
            "transformer": "AddLiterals",
            "lazy": True,  # Enable lazy resolution
            "params": {
                "data": [
                    {"alias": "static_col", "value": "hardcoded"},
                    {"alias": "dynamic_col", "value": "__ns__computed_label"},  # Resolved at runtime
                ]
            }
        }
    ]
}

pipe = load_pipeline(config)
result = pipe.run(orders.head(2))
print(result)

2026-02-09 09:50:33,105 | [INFO]: Nebula Storage: clear. 
2026-02-09 09:50:33,105 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2026-02-09 09:50:33,105 | [INFO]: Nebula Storage: setting an object (<class 'str'>) with the key "computed_label". 
2026-02-09 09:50:33,105 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,105 | [INFO]: Running '(Lazy) AddLiterals' ... 
2026-02-09 09:50:33,105 | [INFO]: Completed '(Lazy) AddLiterals' in 0.0s 
2026-02-09 09:50:33,105 | [INFO]: Pipeline completed in 0.0s 


shape: (2, 6)
┌──────────┬──────────┬────────┬────────┬────────────┬──────────────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ static_col ┆ dynamic_col      │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---        ┆ ---              │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ str        ┆ str              │
╞══════════╪══════════╪════════╪════════╪════════════╪══════════════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ hardcoded  ┆ premium_customer │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ hardcoded  ┆ premium_customer │
└──────────┴──────────┴────────┴────────┴────────────┴──────────────────┘


The `__ns__` prefix works at any nesting depth within `params`.

---
## Part 5: Loops (Dynamic Expansion)

Loops let you generate repetitive pipeline sections dynamically. The `loop` block expands at load time, creating multiple transformers or pipelines from a template.

### 5.1 Basic Loop

In [12]:
config = {
    "pipeline": [
        {
            "loop": {
                "values": {
                    "col_name": ["flag_a", "flag_b", "flag_c"]
                },
                "transformer": "AddLiterals",
                "params": {
                    "data": [{"alias": "<<col_name>>", "value": True}]
                }
            }
        }
    ]
}

pipe = load_pipeline(config)
pipe.show(add_params=True)

*** Pipeline *** (3 transformations)
 - AddLiterals -> PARAMS: data=[{'alias': 'flag_a', 'value': True}]
 - AddLiterals -> PARAMS: data=[{'alias': 'flag_b', 'value': True}]
 - AddLiterals -> PARAMS: data=[{'alias': 'flag_c', 'value': True}]


In [13]:
result = pipe.run(orders.head(2))
print(result)

2026-02-09 09:50:33,120 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,120 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,120 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,120 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,120 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,120 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,120 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,120 | [INFO]: Pipeline completed in 0.0s 


shape: (2, 7)
┌──────────┬──────────┬────────┬────────┬────────┬────────┬────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ flag_a ┆ flag_b ┆ flag_c │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ bool   ┆ bool   ┆ bool   │
╞══════════╪══════════╪════════╪════════╪════════╪════════╪════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ true   ┆ true   ┆ true   │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ true   ┆ true   ┆ true   │
└──────────┴──────────┴────────┴────────┴────────┴────────┴────────┘


### 5.2 Loop Modes: Linear vs Product

| Mode | Description |
|------|-------------|
| `linear` | Zip values together (default) |
| `product` | Cartesian product of all values |

In [14]:
# Linear mode: zip(["a", "b"], [1, 2]) → [("a", 1), ("b", 2)]
config = {
    "pipeline": [
        {
            "loop": {
                "mode": "linear",
                "values": {
                    "name": ["col_x", "col_y"],
                    "val": [100, 200]
                },
                "transformer": "AddLiterals",
                "params": {
                    "data": [{"alias": "<<name>>", "value": "<<val>>"}]
                }
            }
        }
    ]
}

pipe = load_pipeline(config)
pipe.show(add_params=True)
# Creates: AddLiterals(col_x=100), AddLiterals(col_y=200)

*** Pipeline *** (2 transformations)
 - AddLiterals -> PARAMS: data=[{'alias': 'col_x', 'value': 100}]
 - AddLiterals -> PARAMS: data=[{'alias': 'col_y', 'value': 200}]


In [15]:
# Product mode: product(["a", "b"], [1, 2]) → [("a", 1), ("a", 2), ("b", 1), ("b", 2)]
config = {
    "pipeline": [
        {
            "loop": {
                "mode": "product",
                "values": {
                    "prefix": ["US", "EU"],
                    "metric": ["sales", "returns"]
                },
                "transformer": "AddLiterals",
                "params": {
                    "data": [{"alias": "<<prefix>>_<<metric>>", "value": 0}]
                }
            }
        }
    ]
}

pipe = load_pipeline(config)
pipe.show(add_params=True)
# Creates 4 transformers: US_sales, US_returns, EU_sales, EU_returns

*** Pipeline *** (4 transformations)
 - AddLiterals -> PARAMS: data=[{'alias': 'US_sales', 'value': 0}]
 - AddLiterals -> PARAMS: data=[{'alias': 'US_returns', 'value': 0}]
 - AddLiterals -> PARAMS: data=[{'alias': 'EU_sales', 'value': 0}]
 - AddLiterals -> PARAMS: data=[{'alias': 'EU_returns', 'value': 0}]


### 5.3 Loop Over Pipelines

Loops can generate entire pipelines (with branches):

In [16]:
ns.clear()

config = {
    "pipeline": [
        {"store": "original"},  # Store for branch to read
        {
            "loop": {
                "mode": "linear",
                "values": {
                    "col_name": ["metric_a", "metric_b"],
                    "col_value": [10, 20]
                },
                "branch": {
                    "storage": "original",
                    "end": "join",
                    "on": "order_id",
                    "how": "left"
                },
                "pipeline": [
                    {"transformer": "SelectColumns", "params": {"columns": ["order_id"]}},
                    {"transformer": "AddLiterals", "params": {
                        "data": [{"alias": "<<col_name>>", "value": "<<col_value>>"}]
                    }}
                ]
            }
        }
    ]
}

pipe = load_pipeline(config)
pipe.show(add_params=True)

2026-02-09 09:50:33,140 | [INFO]: Nebula Storage: clear. 
2026-02-09 09:50:33,140 | [INFO]: Nebula Storage: 0 keys remained after clearing. 


*** Pipeline *** (4 transformations)
   --> Store df with key "original"
*** Pipeline *** (2 transformations)
------ BRANCH (from storage: original) ------
>> Branch (2 transformations):
     - SelectColumns -> PARAMS: columns=['order_id']
     - AddLiterals -> PARAMS: data=[{'alias': 'metric_a', 'value': 10}]
<<< Join DFs >>>
  - how: left
  - on: order_id
*** Pipeline *** (2 transformations)
------ BRANCH (from storage: original) ------
>> Branch (2 transformations):
     - SelectColumns -> PARAMS: columns=['order_id']
     - AddLiterals -> PARAMS: data=[{'alias': 'metric_b', 'value': 20}]
<<< Join DFs >>>
  - how: left
  - on: order_id


In [17]:
result = pipe.run(orders.head(3))
print(result)

2026-02-09 09:50:33,147 | [INFO]: Starting pipeline 
2026-02-09 09:50:33,147 | [INFO]:    --> Store df with key "original" 
2026-02-09 09:50:33,147 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "original". 
2026-02-09 09:50:33,147 | [INFO]: Entering branch 
2026-02-09 09:50:33,147 | [INFO]: Running 'SelectColumns' ... 
2026-02-09 09:50:33,147 | [INFO]: Completed 'SelectColumns' in 0.0s 
2026-02-09 09:50:33,147 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,147 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,147 | [INFO]: Entering branch 
2026-02-09 09:50:33,147 | [INFO]: Running 'SelectColumns' ... 
2026-02-09 09:50:33,147 | [INFO]: Completed 'SelectColumns' in 0.0s 
2026-02-09 09:50:33,147 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,147 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,147 | [INFO]: Pipeline completed in 0.0s 


shape: (3, 6)
┌──────────┬──────────┬────────┬────────┬──────────┬──────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ metric_a ┆ metric_b │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---      ┆ ---      │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ i32      ┆ i32      │
╞══════════╪══════════╪════════╪════════╪══════════╪══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ 10       ┆ 20       │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ 10       ┆ 20       │
│ 3        ┆ alice    ┆ 200.0  ┆ US     ┆ 10       ┆ 20       │
└──────────┴──────────┴────────┴────────┴──────────┴──────────┘


### 5.4 Skip/Perform with Loops

In [18]:
config = {
    "pipeline": [
        {
            "skip": True,  # Entire loop is skipped
            "loop": {
                "values": {"x": [1, 2, 3]},
                "transformer": "AssertNotEmpty"  # Never runs
            }
        },
        {"transformer": "SelectColumns", "params": {"glob": "*"}}
    ]
}

pipe = load_pipeline(config)
pipe.show()  # Only SelectColumns appears

*** Pipeline *** (1 transformation)
 - SelectColumns


---
## Part 6: Complete Examples

### 6.1 Split Pipeline in Config

In [19]:
def split_by_value(df):
    return {
        "high": df.filter(pl.col("amount") >= 150),
        "low": df.filter(pl.col("amount") < 150),
    }


config = {
    "name": "Value-Based Processing",
    "split_function": "split_by_value",
    "split_order": ["high", "low"],
    "pipeline": {
        "high": [
            {"transformer": "AddLiterals", "params": {
                "data": [{"alias": "priority", "value": "high"}]
            }}
        ],
        "low": [
            {"transformer": "AddLiterals", "params": {
                "data": [{"alias": "priority", "value": "standard"}]
            }}
        ]
    }
}

pipe = load_pipeline(config, extra_functions={"split_by_value": split_by_value})
pipe.show()

result = pipe.run(orders)
print(result)

2026-02-09 09:50:33,166 | [INFO]: Starting pipeline 'Value-Based Processing' 
2026-02-09 09:50:33,166 | [INFO]: Entering split 
2026-02-09 09:50:33,166 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,173 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,173 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,173 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,173 | [INFO]: Pipeline 'Value-Based Processing' completed in 0.0s 


*** Value-Based Processing *** (2 transformations)
------ SPLIT ------ (function: split_by_value)
**SPLIT <<< high >>> (1 transformation):
     - AddLiterals
**SPLIT <<< low >>> (1 transformation):
     - AddLiterals
<<< Append DFs >>>
shape: (5, 5)
┌──────────┬──────────┬────────┬────────┬──────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ priority │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---      │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ str      │
╞══════════╪══════════╪════════╪════════╪══════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ high     │
│ 3        ┆ alice    ┆ 200.0  ┆ US     ┆ high     │
│ 5        ┆ bob      ┆ 300.0  ┆ EU     ┆ high     │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ standard │
│ 4        ┆ carol    ┆ 50.0   ┆ APAC   ┆ standard │
└──────────┴──────────┴────────┴────────┴──────────┘


### 6.2 Branch Pipeline in Config

In [20]:
ns.clear()

# Pre-store customer tiers
tiers = pl.DataFrame({
    "customer": ["alice", "bob", "carol"],
    "tier": ["gold", "silver", "bronze"]
})
ns.set("customer_tiers", tiers)

config = {
    "name": "Enrich with Customer Tier",
    "branch": {
        "storage": "customer_tiers",
        "end": "join",
        "on": "customer",
        "how": "left"
    },
    "pipeline": [
        {"transformer": "SelectColumns", "params": {"columns": ["customer", "tier"]}}
    ]
}

pipe = load_pipeline(config)
pipe.show()

result = pipe.run(orders)
print(result)

2026-02-09 09:50:33,180 | [INFO]: Nebula Storage: clear. 
2026-02-09 09:50:33,180 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
2026-02-09 09:50:33,180 | [INFO]: Nebula Storage: setting an object (<class 'polars.dataframe.frame.DataFrame'>) with the key "customer_tiers". 
2026-02-09 09:50:33,180 | [INFO]: Starting pipeline 'Enrich with Customer Tier' 
2026-02-09 09:50:33,180 | [INFO]: Entering branch 
2026-02-09 09:50:33,180 | [INFO]: Running 'SelectColumns' ... 
2026-02-09 09:50:33,180 | [INFO]: Completed 'SelectColumns' in 0.0s 
2026-02-09 09:50:33,180 | [INFO]: Pipeline 'Enrich with Customer Tier' completed in 0.0s 


*** Enrich with Customer Tier *** (1 transformation)
------ BRANCH (from storage: customer_tiers) ------
>> Branch (1 transformation):
     - SelectColumns
<<< Join DFs >>>
shape: (5, 5)
┌──────────┬──────────┬────────┬────────┬────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ tier   │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---    │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ str    │
╞══════════╪══════════╪════════╪════════╪════════╡
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ gold   │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ silver │
│ 3        ┆ alice    ┆ 200.0  ┆ US     ┆ gold   │
│ 4        ┆ carol    ┆ 50.0   ┆ APAC   ┆ bronze │
│ 5        ┆ bob      ┆ 300.0  ┆ EU     ┆ silver │
└──────────┴──────────┴────────┴────────┴────────┘


### 6.3 Apply-to-Rows in Config

In [21]:
config = {
    "name": "Discount High-Value Orders",
    "apply_to_rows": {
        "input_col": "amount",
        "operator": "ge",
        "value": 200
    },
    "pipeline": [
        {"transformer": "AddLiterals", "params": {
            "data": [{"alias": "discount", "value": 0.10}]
        }}
    ],
    "otherwise": [
        {"transformer": "AddLiterals", "params": {
            "data": [{"alias": "discount", "value": 0.0}]
        }}
    ]
}

pipe = load_pipeline(config)
pipe.show()

result = pipe.run(orders)
print(result)

2026-02-09 09:50:33,191 | [INFO]: Starting pipeline 'Discount High-Value Orders' 
2026-02-09 09:50:33,195 | [INFO]: Entering apply_to_rows 
2026-02-09 09:50:33,196 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,197 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,198 | [INFO]: Running 'AddLiterals' ... 
2026-02-09 09:50:33,199 | [INFO]: Completed 'AddLiterals' in 0.0s 
2026-02-09 09:50:33,200 | [INFO]: Pipeline 'Discount High-Value Orders' completed in 0.0s 


*** Discount High-Value Orders *** (2 transformations)
------ APPLY TO ROWS (amount ge 200) ------
>> Apply To rows (1 transformation):
     - AddLiterals
>> Otherwise (1 transformation)
     - AddLiterals
<<< Append DFs >>>
shape: (5, 5)
┌──────────┬──────────┬────────┬────────┬──────────┐
│ order_id ┆ customer ┆ amount ┆ region ┆ discount │
│ ---      ┆ ---      ┆ ---    ┆ ---    ┆ ---      │
│ i64      ┆ str      ┆ f64    ┆ str    ┆ f64      │
╞══════════╪══════════╪════════╪════════╪══════════╡
│ 3        ┆ alice    ┆ 200.0  ┆ US     ┆ 0.1      │
│ 5        ┆ bob      ┆ 300.0  ┆ EU     ┆ 0.1      │
│ 1        ┆ alice    ┆ 150.0  ┆ US     ┆ 0.0      │
│ 2        ┆ bob      ┆ 75.0   ┆ EU     ┆ 0.0      │
│ 4        ┆ carol    ┆ 50.0   ┆ APAC   ┆ 0.0      │
└──────────┴──────────┴────────┴────────┴──────────┘


---
## Summary

| Feature | Syntax |
|---------|--------|
| Basic transformer | `{"transformer": "Name", "params": {...}}` |
| Skip/disable | `"skip": true` or `"perform": false` |
| Description | `"description": "Human-readable text"` |
| Lazy params | `"lazy": true` + `"__ns__key"` in params |
| Storage | `{"store": "key"}`, `{"store_debug": "key"}` |
| Loops | `{"loop": {"values": {...}, "transformer": ...}}` |
| Custom transformers | `extra_transformers={"Name": Class}` |
| Custom functions | `extra_functions={"name": func}` |

**Best practices:**
- Use `dict[str, T]` for extras (refactor-safe)
- Keep config keys stable, even if Python names change
- Use `skip`/`perform` for feature flags
- Organize custom transformers in modules with `__all__`

In [22]:
ns.clear()

2026-02-09 09:50:33,207 | [INFO]: Nebula Storage: clear. 
2026-02-09 09:50:33,207 | [INFO]: Nebula Storage: 0 keys remained after clearing. 
